By Daniel Perruchoud and George Rowlands
This notebook delves into the exciting realm of Cleantech using a dataset of nearly 10,000 news articles from Kaggle, all centered around the energy sector. We'll embark on a journey that includes data exploration, text preprocessing, and culminates in the creation of a Retrieval-Augmented Generation Pipeline (RAG). This powerful approach empowers us to construct an LLM (Large Language Model) that can intelligently answer user queries, drawing upon the knowledge from our curated news articles.
Fine-tuning an LLM can be a resource-intensive and inflexible process. RAG offers a compelling alternative. It leverages a semantic search to pinpoint relevant sections within our news articles that directly address a user's question. These retrieved sections are then provided to the LLM as context, enabling it to deliver informed and insightful responses.

To run this notebook we recommend downloading the provided GitHub repository and opening this notebook in Google Colab. To ensure a smooth experience, you'll need:
At the start of the notebook a data.zip will be downloaded from a Google Drive and unzipped. This will then provide you with files that contain checkpoints for all of the expensive processing sections such as chunking, generating embeddings and evaluating the pipeline with an LLM as a judge. This saves you money and a lot of time.
If you can't or don't want to run this notebook you can also view the completed notebook by opening the cleantech_rag.html file in your browser.
Throughout this notebook, we'll delve into the intricate workings of RAG pipelines. Prepare to explore:
Questions or Issues? We're Here to Help!
If you encounter any roadblocks or have questions, please don't hesitate to reach out to George Rowlands
This OpenAI Key is used for the following tasks:
%%writefile .env
OPENAI_API_KEY=ENTER_HERE
Overwriting .env
After executing the above cell, you should restart the kernel/runtime to ensure the key is properly set.
%%writefile requirements.txt
chromadb==0.5.0
datasets==2.19.1
gdown==5.2.0
kaggle==1.6.1
langchain==0.2.0
langchain-community==0.2.0
langchain-experimental==0.0.59
langchain-openai==0.1.7
langdetect==1.0.9
lorem-text==2.1
nbformat>=4.2.0
plotly==5.22.0
pretty-jupyter==1.0
ragas==0.1.8
seaborn==0.13.2
sentence-transformers==3.0.0
spacy>=3.7
textstat==0.7.3
umap-learn==0.5.5
Overwriting requirements.txt
%pip install torch==2.3.0 --quiet --index-url https://download.pytorch.org/whl/cu121
Note: you may need to restart the kernel to use updated packages.
%pip install -r ./requirements.txt --quiet
Note: you may need to restart the kernel to use updated packages.
import json
import os
import warnings
import zipfile
from collections import Counter
from pathlib import Path
from typing import Dict, List
import chromadb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import torch
from chromadb import Collection, Documents, EmbeddingFunction, Embeddings
from datasets import Dataset
from dotenv import load_dotenv
from langdetect import detect
from lorem_text import lorem
from ragas import RunConfig, evaluate
from ragas.metrics import (faithfulness, answer_relevancy, context_relevancy, answer_correctness)
from spacy.lang.en import English
from textstat import flesch_reading_ease
from tqdm import tqdm
import umap
from langchain.chains.base import Chain
from langchain.text_splitter import RecursiveCharacterTextSplitter, TextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma, VectorStore
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_core.language_models import LLM
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.retrievers import BaseRetriever
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
load_dotenv()
warnings.filterwarnings("ignore")
By removing the first line from the two cells below, you can download the dataset, chunks, embeddings of the chunks and evaluation results from our Google Drive. This will save you time and money.
%%script echo skipping
!gdown 1MoT_s_Zk4dzRRy7E7Va5ZuTROIOI1FfZ
Couldn't find program: 'echo'
%%script echo skipping
with zipfile.ZipFile("data.zip", "r") as zip_file:
zip_file.extractall()
Couldn't find program: 'echo'
To make sure our OpenAI Key is working we will test it by generating a response from GPT-4o which we will later on also be using in our RAG pipeline. Try some different prompts or questions to see how the model responds.
llm = ChatOpenAI(model="gpt-4o")
question_prompt = ChatPromptTemplate.from_template(
"Answer the following question: {question}")
question_chain = question_prompt | llm | StrOutputParser()
question_chain.invoke({"question": "What is the meaning of life?"})
'The question about the meaning of life has been a central philosophical and existential inquiry for centuries, with various interpretations and answers depending on cultural, religious, philosophical, and individual perspectives. Here are a few approaches to consider:\n\n1. **Philosophical Perspective**: Many philosophers have explored this question. For instance, existentialists like Jean-Paul Sartre argue that life has no inherent meaning and it\'s up to individuals to create their own purpose.\n\n2. **Religious Perspective**: Different religions offer varied interpretations. For example, in Christianity, the meaning of life is often seen as living in accordance with God\'s will and seeking salvation. In Buddhism, it involves reaching enlightenment and escaping the cycle of rebirth.\n\n3. **Scientific Perspective**: From a scientific viewpoint, life can be seen as a process of evolution and survival. The "meaning" might be interpreted as the continuation and propagation of life through reproduction and adaptation.\n\n4. **Personal Perspective**: Many people find meaning through personal fulfillment, relationships, achievements, and contributing to the well-being of others. This is often subjective and varies greatly among individuals.\n\nUltimately, the meaning of life might be a combination of these perspectives, and it often depends on personal beliefs, values, and experiences.'
We will be exploring the following Cleantech Media Dataset. If you have opened this notebook as recommended by opening the provided Github repository in Google Colab then you don't need to to download the dataset. It should already be under data/bronze. If not then you can either manually download it and upload it into a data/bronze folder or follow the steps below.
We will be using the Kaggle API to download the data.
To use the Kaggle API you will need a Kaggle account. If you don't already have one, sign up for a Kaggle account at https://www.kaggle.com. When you are logged in, go to the 'Settings' tab of your user profile https://www.kaggle.com/settings and select 'Create New Token'. This will trigger the download of kaggle.json, a file containing your API credentials.
You can then add your Kaggle username and key from the kaggle.json.
data_folder = Path("./data")
if not data_folder.exists():
data_folder.mkdir()
bronze_folder = data_folder / "bronze"
if not bronze_folder.exists():
bronze_folder.mkdir()
%%script echo skipping
kaggle_user = "XXXXXXXXXXXXXXXX"
kaggle_key = "XXXXXXXXXXXXXXXX"
Couldn't find program: 'echo'
%%script echo skipping
os.system(f"kaggle datasets download -d jannalipenkova/cleantech-media-dataset -p {bronze_folder}")
Couldn't find program: 'echo'
%%script echo skipping
with zipfile.ZipFile(bronze_folder / "cleantech-media-dataset.zip", "r") as zip_file:
zip_file.extractall(bronze_folder)
Couldn't find program: 'echo'
We now load and inspect both the Cleantech Media Dataset and the gold-standard evaluation data provided by our subject matter expert, Janna Lipenkova.
articles_df = pd.read_csv(
bronze_folder / "cleantech_media_dataset_v2_2024-02-23.csv",
encoding='utf-8', index_col=0)
articles_df.head()
| title | date | author | content | domain | url | |
|---|---|---|---|---|---|---|
| 1280 | Qatar to Slash Emissions as LNG Expansion Adva... | 2021-01-13 | NaN | ["Qatar Petroleum ( QP) is targeting aggressiv... | energyintel | https://www.energyintel.com/0000017b-a7dc-de4c... |
| 1281 | India Launches Its First 700 MW PHWR | 2021-01-15 | NaN | ["β’ Nuclear Power Corp. of India Ltd. ( NPCIL)... | energyintel | https://www.energyintel.com/0000017b-a7dc-de4c... |
| 1283 | New Chapter for US-China Energy Trade | 2021-01-20 | NaN | ["New US President Joe Biden took office this ... | energyintel | https://www.energyintel.com/0000017b-a7dc-de4c... |
| 1284 | Japan: Slow Restarts Cast Doubt on 2030 Energy... | 2021-01-22 | NaN | ["The slow pace of Japanese reactor restarts c... | energyintel | https://www.energyintel.com/0000017b-a7dc-de4c... |
| 1285 | NYC Pension Funds to Divest Fossil Fuel Shares | 2021-01-25 | NaN | ["Two of New York City's largest pension funds... | energyintel | https://www.energyintel.com/0000017b-a7dc-de4c... |
human_eval_df = pd.read_csv(
bronze_folder / "cleantech_rag_evaluation_data_2024-02-23.csv",
encoding='utf-8', index_col=0)
human_eval_df.head()
| question_id | question | relevant_chunk | article_url | |
|---|---|---|---|---|
| example_id | ||||
| 1 | 1 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | https://www.sgvoice.net/strategy/technology/23... |
| 2 | 2 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | https://www.sgvoice.net/policy/25396/eu-seeks-... |
| 3 | 2 | What is the EUβs Green Deal Industrial Plan? | The European counterpart to the US Inflation R... | https://www.pv-magazine.com/2023/02/02/europea... |
| 4 | 3 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | https://www.sgvoice.net/policy/25396/eu-seeks-... |
| 5 | 4 | When did the cooperation between GM and Honda ... | What caught our eye was a new hookup between G... | https://cleantechnica.com/2023/05/08/general-m... |
As the saying goes, "garbage in, garbage out." In the realm of machine learning, the quality of our outputs hinges on the quality of our inputs. This section delves into the essential processes of Exploratory Data Analysis (EDA) and data preprocessing. Through EDA, we'll illuminate the characteristics, patterns, and potential quirks residing within our cleantech news article dataset. Preprocessing will ensure our data is cleansed, structured, and prepared to be effectively utilized by the RAG pipeline, laying the foundation for high-quality results.
Let us start by gaining an overview of the datasets features (columns).
articles_df.describe()
| title | date | author | content | domain | url | |
|---|---|---|---|---|---|---|
| count | 9593 | 9593 | 31 | 9593 | 9593 | 9593 |
| unique | 9569 | 967 | 7 | 9588 | 19 | 9593 |
| top | Cleantech Thought Leaders Series | 2023-05-04 | Michael Holder | ['Geopolitics as much as price or quality will... | cleantechnica | https://www.energyintel.com/0000017b-a7dc-de4c... |
| freq | 5 | 427 | 8 | 2 | 1861 | 1 |
articles_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 9593 entries, 1280 to 81816 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 9593 non-null object 1 date 9593 non-null object 2 author 31 non-null object 3 content 9593 non-null object 4 domain 9593 non-null object 5 url 9593 non-null object dtypes: object(6) memory usage: 524.6+ KB
Our initial exploration reveals that the "author" column only contains data for 31 out of 9593 articles. Since this offers minimal information gain, we can remove this feature.
We've also observed that some titles and content entries appear to be non-unique. This might necessitate identifying and removing duplicate entries.
On a positive note, the article URLs are all unique, potentially serving as suitable unique identifiers for the data.
articles_df = articles_df.drop(columns=["author"])
The dataset helpfully provides the domain names extracted from the article URLs. These domains essentially represent the publishers of the news articles. Let's analyze the distribution of publishers and see how many articles each publisher has contributed.
domain_counts = articles_df["domain"].value_counts()
domain_counts
domain cleantechnica 1861 azocleantech 1627 pv-magazine 1206 energyvoice 1017 solarindustrymag 673 naturalgasintel 658 thinkgeoenergy 645 rechargenews 559 solarpowerworldonline 505 energyintel 234 pv-tech 232 businessgreen 158 greenprophet 80 ecofriend 38 solarpowerportal.co 34 eurosolar 28 decarbxpo 19 solarquarter 17 indorenergy 2 Name: count, dtype: int64
A visualization helps us to understand the skew in the data.
barplot = sns.barplot(
x=domain_counts.values,
y=domain_counts.index,
hue=domain_counts.index
)
barplot.set_title('Article Counts by Domain')
barplot.set_xlabel('Article Count')
barplot.set_ylabel('Domain')
plt.show()
Our exploration of article domains reveals a skewed distribution. Publishers like cleantechnica have a significantly higher representation (1861 articles), while others like indoenergy have minimal contributions (2 articles). If we proceed with sampling this data, this imbalance should be taken into account. Stratified sampling is the recommended approach to ensure a representative sample across different publishers.
Each article within the dataset is accompanied by a publication date. Let's delve into the temporal range of these articles and investigate any noteworthy patterns in publication trends.
# plot the amount of articles over time
articles_df["date"] = pd.to_datetime(articles_df["date"])
time_df = articles_df.groupby("date").size().reset_index()
time_df.columns = ["date","count"]
time_df.describe()
| date | count | |
|---|---|---|
| count | 967 | 967.000000 |
| mean | 2022-06-01 19:11:06.390899456 | 9.920372 |
| min | 2021-01-01 00:00:00 | 1.000000 |
| 25% | 2021-09-11 12:00:00 | 4.000000 |
| 50% | 2022-06-06 00:00:00 | 9.000000 |
| 75% | 2023-02-14 12:00:00 | 13.000000 |
| max | 2023-12-05 00:00:00 | 427.000000 |
| std | NaN | 15.206340 |
sns.lineplot(data=time_df, x="date", y="count")
plt.title("Article Count Over Time")
plt.xlabel("Date")
plt.xticks(rotation=90)
plt.ylabel("Article Count")
# add a line for the average
avg_count = time_df["count"].mean()
plt.axhline(avg_count, color='r', linestyle='--', label=f"Average article count per day: {avg_count:.2f}")
plt.legend()
plt.show()
While the daily article count appears consistent overall, a significant outlier disrupts the pattern on the 2023-12-05. The cause of this outlier is undetermined, but it could potentially be the date the data was scraped and the default value assigned for missing dates. Since the publication date is not crucial for RAG pipeline, we can remove it.
articles_df = articles_df.drop(columns=["date"])
As noted in our initial exploration, some articles share identical titles. Here, we'll focus on identifying and handling these duplicate titles to ensure a clean and consistent dataset for our RAG pipeline.
sns.histplot(articles_df["title"].str.len())
plt.title("Title Length Distribution")
plt.xlabel("Title Length")
plt.ylabel("Count")
avg_count = articles_df["title"].str.len().mean()
plt.axvline(avg_count, color='r', linestyle='--', label=f"Average title length: {avg_count:.2f}")
plt.legend()
plt.show()
articles_df["title"].duplicated().sum()
24
duplicate_titles = articles_df[articles_df["title"].duplicated(keep=False)].sort_values("title")
duplicate_titles.head(10)
| title | content | domain | url | |
|---|---|---|---|---|
| 6654 | Aberdeenβ s NZTC plans national centre for geo... | ['Aberdeenβ s NZTC is planning a national cent... | energyvoice | https://www.energyvoice.com/renewables-energy-... |
| 6660 | Aberdeenβ s NZTC plans national centre for geo... | ['Aberdeenβ s NZTC is planning a national cent... | energyvoice | https://sgvoice.energyvoice.com/strategy/techn... |
| 38593 | About David J. Cross | ["By clicking `` Allow All '' you agree to the... | azocleantech | https://www.azocleantech.com/authors/david-cross |
| 38599 | About David J. Cross | ["By clicking `` Allow All '' you agree to the... | azocleantech | https://www.azocleantech.com/authors/david-cro... |
| 38596 | About David J. Cross | ["By clicking `` Allow All '' you agree to the... | azocleantech | https://www.azocleantech.com/authors/david-cro... |
| 38598 | About David J. Cross | ["By clicking `` Allow All '' you agree to the... | azocleantech | https://www.azocleantech.com/authors/david-cro... |
| 38597 | About David J. Cross | ["By clicking `` Allow All '' you agree to the... | azocleantech | https://www.azocleantech.com/authors/david-cro... |
| 6704 | BEIS mulls ringfenced CfD support for geotherm... | ['Ministers are considering whether geothermal... | energyvoice | https://sgvoice.energyvoice.com/policy/21121/b... |
| 6702 | BEIS mulls ringfenced CfD support for geotherm... | ['Ministers are considering whether geothermal... | energyvoice | https://www.energyvoice.com/renewables-energy-... |
| 37040 | Cleantech Insights from Industry Series | ["By clicking `` Allow All '' you agree to the... | azocleantech | https://www.azocleantech.com/Insights.aspx?page=2 |
duplicate_titles["content"].duplicated().sum()
0
Our exploration identified 24 titles that appear multiple times in the dataset. Examples include "About David J. Cross." Interestingly, while the titles are identical, the content itself appears to be unique.
Here are some additional observations for further investigation:
def wrap_text(text: str, char_per_line=100) -> str:
# for better readability, wrap the text at the last space before the char_per_line
if len(text) < char_per_line:
return text
else:
return text[:char_per_line].rsplit(' ', 1)[0] + '\n' + wrap_text(text[len(text[:char_per_line].rsplit(' ', 1)[0])+1:], char_per_line)
print(duplicate_titles.iloc[0]["title"])
print(wrap_text(duplicate_titles.iloc[0]["content"]))
Aberdeenβ s NZTC plans national centre for geothermal energy ['Aberdeenβ s NZTC is planning a national centre to accelerate geothermal energy in the UK and become the β go-to β hub globally for the renewable technology.', 'Calum Watson, senior project engineer at the Net Zero Technology Centre, has outlined ambitions for the oil and gas industry to help ramp up the clean energy, both onshore and in the North Sea.', 'The NZTCβ s new β National Geothermal Innovation Centre β would develop technology and help create β bespoke regulation β for geothermal, with the aim of it providing 5% of UK energy needs by 2030.', 'By 2050, Mr Watson said geothermal could account for 20% of Britainβ s energy mix, slashing carbon emissions in the process.', 'Geothermal is a burgeoning technology β which has been picked up in some countries like Iceland and the Philippines β which harnesses heat in the subsurface of the earth to generate electricity.', 'Some barriers to its uptake include expensive up-front costs like exploration and drilling.', 'However a report published this week by trade body Offshore Energies UK said there are 2,100 offshore oil and gas wells to be decommissioned in the North Sea next decade β which Mr Watson described as a β massive opportunity β for geothermal', 'Based at a β north-east location β, the new hub would be the β go to centre globally for geothermal technology challenges but, crucially, would be world-leading in supporting government, and creating legislation and best practice for geothermal β.', 'Speaking at the Offshore Decommissioning Conference in St Andrews on Tuesday, Mr Watson did not disclose whether the plan had backers or when it might be set up.', 'He said it would be achieved through a β partner-led roadmap β akin to the NZTC itself β which is funded with Β£180m of UK and Scottish Government funding β and ultimately be powered by geothermal energy.', 'The national base would comprise a β solution centre β to scale up technologies from pilot stage.', 'It would also have a knowledge hub to share learnings and an β accelerator programme β to fund start-ups.', 'The NZTC has already dipped its toe into the tech β supporting a β first of its kind β test project for the EnQuest Magnus platform in the North Sea.', 'Mr Watson set out his hopes for what the centre could achieve by 2030, and highlighted the opportunity for oil and gas workers to transfer to the sustainable technology.', 'β ( By 2030) we want the centre to have delivered geothermal energy, accounting for 5% of the UKβ s energy mix and on route for 20% by 2050.', 'β We would have multiple demonstrators successfully delivered to showcase and educate and, long term, the center will be run on geothermal energy.']
print(duplicate_titles.iloc[1]["title"])
print(wrap_text(duplicate_titles.iloc[1]["content"]))
Aberdeenβ s NZTC plans national centre for geothermal energy ['Aberdeenβ s NZTC is planning a national centre to accelerate geothermal energy in the UK and become the β go-to β hub globally for the renewable technology.', 'Calum Watson, senior project engineer at the Net Zero Technology Centre, has outlined ambitions for the oil and gas industry to help ramp up the clean energy, both onshore and in the North Sea.', 'The NZTCβ s new β National Geothermal Innovation Centre β would develop technology and help create β bespoke regulation β for geothermal, with the aim of it providing 5% of UK energy needs by 2030.', 'By 2050, Mr Watson said geothermal could account for 20% of Britainβ s energy mix, slashing carbon emissions in the process.', 'Geothermal is a burgeoning technology β which has been picked up in some countries like Iceland and the Philippines β which harnesses heat in the subsurface of the earth to generate electricity.', 'Some barriers to its uptake include expensive up-front costs like exploration and drilling.', 'However a report published this week by trade body Offshore Energies UK said there are 2,100 offshore oil and gas wells to be decommissioned in the North Sea next decade β which Mr Watson described as a β massive opportunity β for geothermal', 'Based at a β north-east location β, the new hub would be the β go to centre globally for geothermal technology challenges but, crucially, would be world-leading in supporting government, and creating legislation and best practice for geothermal β.', 'Speaking at the Offshore Decommissioning Conference in St Andrews on Tuesday, Mr Watson did not disclose whether the plan had backers or when it might be set up.', 'He said it would be achieved through a β partner-led roadmap β akin to the NZTC itself β which is funded with Β£180m of UK and Scottish Government funding β and ultimately be powered by geothermal energy.', 'The national base would comprise a β solution centre β to scale up technologies from pilot stage.', 'It would also have a knowledge hub to share learnings and an β accelerator programme β to fund start-ups.', 'The NZTC has already dipped its toe into the tech β supporting a β first of its kind β test project for the EnQuest Magnus platform in the North Sea.', 'Mr Watson set out his hopes for what the centre could achieve by 2030, and highlighted the opportunity for oil and gas workers to transfer to the sustainable technology.', 'β ( By 2030) we want the centre to have delivered geothermal energy, accounting for 5% of the UKβ s energy mix and on route for 20% by 2050.']
Our analysis suggests potential redundancy within certain articles. In some cases, the second article might appear to be the first article with an additional sentence appended at the end.
Let's take a closer look at these "energyvoice" articles and how the contents start and see if we can eliminate these redundancies.
energyvoice_articles = articles_df[articles_df["domain"].str.contains("energyvoice")]
energyvoice_articles.content.map(lambda x: x[:50]).value_counts()
content
['', '', 'The Megawatt Hour is the latest podcast 6
['A group of trade associations from across the en 3
['Two years after the Amazon Pledge Fund invested 3
['The latest analysis shows that capital flows tow 2
['Macquarie Group is betting the North Sea β engin 2
..
['Now more than ever β in terms of cost and the im 1
['Scientists have hailed a helium discovery which 1
['Marine equipment fabrication and rental speciali 1
['The Russian powers behind oil explorers Exillon 1
['Aberdeen-headquartered Repsol Sinopec Resources 1
Name: count, Length: 980, dtype: int64
def remove_prefix_articles(df: pd.DataFrame, prefix_len: int = 100) -> pd.DataFrame:
"""
Takes O(n^2) time complexity
If the first {prefix_len} characters of the article are the same, then we consider them as a prefix.
If an article is a prefix of a longer article, then we remove it.
If an article is a prefix of longer article, but they have different titles, then we keep them.
"""
df["char_len"] = df["content"].map(len)
df = df.sort_values(by='char_len', ascending=True).reset_index(drop=True)
# Initialize a list to keep the articles that are not prefixes of others
non_prefix_articles = []
for i, row in df.iterrows():
is_prefix = False
content_i = row['content'][:prefix_len]
title_i = row['title']
for j in range(i + 1, len(df)):
content_j = df.at[j, 'content'][:prefix_len]
title_j = df.at[j, 'title']
if content_i == content_j:
# If the prefix matches but the titles are different, we keep it
if title_i != title_j:
continue
else:
is_prefix = True
break
if not is_prefix:
non_prefix_articles.append(row)
print(f"Removed {len(df) - len(non_prefix_articles)} prefix articles")
return pd.DataFrame(non_prefix_articles)
energyvoice_articles = remove_prefix_articles(energyvoice_articles)
energyvoice_articles.content.map(lambda x: x[:100]).value_counts()
Removed 11 prefix articles
content
['', '', 'The Megawatt Hour is the latest podcast boxset brought to you by Energy Voice Out Loud in 6
['Two years after the Amazon Pledge Fund invested in Hippo Harvest, the company is selling its first 3
['A group of trade associations from across the energy sector have written to the Chancellor urging 3
['Global Port Services has confirmed the award of multiple contracts in support of the Seagreen wind 2
['DNV report shows Jotunβ s Baltoflake solution offers beyond 30 yearsβ protection for offshore asse 2
..
['The deal volume for renewable energy assets in Asia more than tripled to $ 13.6 billion in 2021, a 1
['Several young energy professionals have undertaken a voyage across Scotland to spotlight the count 1
['A UK-backed research group unveiled a design for a liquid hydrogen-powered airliner theoretically 1
['UK-listed Pharos Energy is excited about its upcoming Vietnam activities with a 3D seismic shoot l 1
['With the greatest and most urgent energy transition in human history accelerating, the quest for n 1
Name: count, Length: 981, dtype: int64
There still seem to be be some redundancy, but we did manage to remove 11 duplicates.
Having explored various aspects of our dataset, we now turn our attention to the heart of the matter: the article content itself. This section will delve into the analysis and preprocessing techniques we'll employ to ensure the content is high-quality and effectively utilized by our RAG pipeline.
We start with a visual inspection of the article content.
np.random.seed(7)
random_sample_id = np.random.choice(articles_df.index)
print(wrap_text(articles_df.loc[random_sample_id, "content"]))
['Enphase Energy Inc., a supplier of microinverter-based solar and battery systems, says its partner Lumio will be significantly expanding its offering of Enphase IQ8 microinverters and IQ batteries to customers across the United States.', 'The strategic relationship with Lumio will amplify the impact and distribution of Enphase systems, providing homeowners more access to reliable, sustainable and grid-independent power sources, the company says.', 'β We are excited about Enphaseβ s full suite of products β including microinverters, batteries and EV chargers β that can provide our customers best-in-class home energy management solutions, β says Greg Butterfield, CEO at Lumio. β Additionally, the Enphase digital platform, from lead generation to permitting to ongoing operations and maintenance services, offers a unique ability for Lumio to increase efficiencies and reduce costs. β', 'For homeowners who want battery backup, there are no sizing restrictions on pairing Enphase IQ batteries with IQ8 microinverters, and the Sunlight Jump Start feature can restart a home energy system β switching to sunlight-only after prolonged grid outages that may result in a fully depleted battery. This eliminates the need for a manual restart of the system and gives homeowners greater assurance of energy resilience.', 'β This strategic relationship with Enphase makes it easier for Lumioβ s customers to take control of their power production, power consumption, and increase the security and reliability of their familyβ s power supply, β adds David Schonberg, senior vice president of energy partnerships at Lumio.', 'Solar Industry offers industry participants probing, comprehensive assessments of the technology, tools and trends that are driving this dynamic energy sector. From raw materials straight through to end-user applications, we capture and analyze the critical details that help professionals stay current and navigate the solar market.', 'Β© Copyright Zackin Publications Inc. All Rights Reserved.']
Our initial examination reveals that article content is currently stored as a list of strings. To gain deeper understanding and facilitate preprocessing, we'll transform these lists into a more cohesive textual format.
articles_df['article'] = articles_df['content'].apply(lambda x: ' '.join(eval(x)))
print(wrap_text(articles_df.loc[random_sample_id, "article"]))
Enphase Energy Inc., a supplier of microinverter-based solar and battery systems, says its partner Lumio will be significantly expanding its offering of Enphase IQ8 microinverters and IQ batteries to customers across the United States. The strategic relationship with Lumio will amplify the impact and distribution of Enphase systems, providing homeowners more access to reliable, sustainable and grid-independent power sources, the company says. β We are excited about Enphaseβ s full suite of products β including microinverters, batteries and EV chargers β that can provide our customers best-in-class home energy management solutions, β says Greg Butterfield, CEO at Lumio. β Additionally, the Enphase digital platform, from lead generation to permitting to ongoing operations and maintenance services, offers a unique ability for Lumio to increase efficiencies and reduce costs. β For homeowners who want battery backup, there are no sizing restrictions on pairing Enphase IQ batteries with IQ8 microinverters, and the Sunlight Jump Start feature can restart a home energy system β switching to sunlight-only after prolonged grid outages that may result in a fully depleted battery. This eliminates the need for a manual restart of the system and gives homeowners greater assurance of energy resilience. β This strategic relationship with Enphase makes it easier for Lumioβ s customers to take control of their power production, power consumption, and increase the security and reliability of their familyβ s power supply, β adds David Schonberg, senior vice president of energy partnerships at Lumio. Solar Industry offers industry participants probing, comprehensive assessments of the technology, tools and trends that are driving this dynamic energy sector. From raw materials straight through to end-user applications, we capture and analyze the critical details that help professionals stay current and navigate the solar market. Β© Copyright Zackin Publications Inc. All Rights Reserved.
articles_df["article"].duplicated().sum()
5
duplicate_articles = articles_df[articles_df["article"].duplicated(keep=False)].sort_values("article")
duplicate_articles
| title | content | domain | url | article | |
|---|---|---|---|---|---|
| 78215 | China's wind giants are chasing global growth:... | ['Geopolitics as much as price or quality will... | rechargenews | https://www.rechargenews.com/wind/chinas-wind-... | Geopolitics as much as price or quality will d... |
| 78216 | Why geopolitics will set the limits of China's... | ['Geopolitics as much as price or quality will... | rechargenews | https://www.rechargenews.com/wind/why-geopolit... | Geopolitics as much as price or quality will d... |
| 80067 | Sodium-ion battery production capacity to grow... | ['Global demand for sodium-ion batteries is ex... | pv-magazine | https://www.pv-magazine.com/2023/07/17/sodium-... | Global demand for sodium-ion batteries is expe... |
| 80073 | Sodium-ion battery fleet to grow to 10 GWh by ... | ['Global demand for sodium-ion batteries is ex... | pv-magazine | https://www.pv-magazine.com/2023/07/17/sodium-... | Global demand for sodium-ion batteries is expe... |
| 6685 | Indonesia seeks investors for giant geothermal... | ['Indonesia, home to the worldβ s largest geot... | energyvoice | https://www.energyvoice.com/oilandgas/467719/i... | Indonesia, home to the worldβ s largest geothe... |
| 6689 | Indonesia seeks investors for giant geothermal... | ['Indonesia, home to the worldβ s largest geot... | energyvoice | https://sgvoice.energyvoice.com/investing/2002... | Indonesia, home to the worldβ s largest geothe... |
| 78225 | Quest for endless green energy from Earth's co... | ['One of Japanβ s largest utility groups Chubu... | rechargenews | https://www.rechargenews.com/energy-transition... | One of Japanβ s largest utility groups Chubu E... |
| 78227 | Limitless green energy from Earth's core quest... | ['One of Japanβ s largest utility groups Chubu... | rechargenews | https://www.rechargenews.com/news/2-1-1487279 | One of Japanβ s largest utility groups Chubu E... |
| 78210 | Portugal energy transition plan targets massiv... | ['Portugal has more than doubled its 2030 goal... | rechargenews | https://www.rechargenews.com/energy-transition... | Portugal has more than doubled its 2030 goals ... |
| 78212 | Wind, hydrogen and solar fused in Portugal's p... | ['Portugal has more than doubled its 2030 goal... | rechargenews | https://www.rechargenews.com/energy-transition... | Portugal has more than doubled its 2030 goals ... |
Our analysis uncovers additional insights regarding content duplication. We observe cases where seemingly identical articles are reposted on the same domain but with different titles (excluding the "sgvoice.energyvoice.com" vs. "energyvoice.com" scenario previously addressed). Here, we'll strategically keep these duplicates where contents are the same but titles are different.
Importance of Titles
We keep these duplicate articles because titles can hold information relevant for our RAG pipeline. Consider a scenario where a user query uses an abbreviation, while the corresponding article only contains the abbreviation in the title, in the content always the full term is used. To bridge this gap, we'll prepend titles to the article content during preprocessing. This ensures that the retrieval process considers not only the content itself, but also the potentially informative titles.
Next Step
As previously noted, some articles exhibit standardized introductions, possibly artifacts of the data scraping process. We'll develop appropriate techniques to handle these introductions during preprocessing, ensuring they don't hinder the effectiveness of our RAG pipeline.
articles_df.article.map(lambda x: x[:50]).value_counts()
article
By clicking `` Allow All '' you agree to the stori 1627
Sign in to get the best natural gas news and data. 658
window.dojoRequire ( [ `` mojo/signup-forms/Loader 52
None of these red flags by themselves make a compa 19
Volkswagen ID.4 sales were up 254% in the 1st quar 14
...
You want to invest in renewable energy or a better 1
The best way to deal with carbon is not to release 1
When there is deflation, the prices of goods in th 1
Stickers are excellent products to leverage in bot 1
Arevon Energy Inc. has closed financing on the Vik 1
Name: count, Length: 6765, dtype: int64
artifacts = [
"By clicking `` Allow All '' you agree to the sto",
"Sign in to get the best natural gas news and dat",
"window.dojoRequire ( [ `` mojo/signup-forms/Load"
]
for artifact in artifacts:
print(wrap_text(articles_df[articles_df.article.str.startswith(artifact)].article.iloc[0][:500]))
print()
By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site
navigation, analyse site usage and support us in providing free open access scientific content.
More info. Nel Hydrogen is committed to pushing the boundaries of science and continues to support
the research and development of new and innovative technologies. A group of leading researchers and
two employees of Proton Energy Systems, Inc., a subsidiary of Nel ASA ( Nel Hydrogen) have recently
published
Sign in to get the best natural gas news and data. Follow the topics you want and receive the daily
emails. Your email address * Your password * Remember me Continue Reset password Featured Content
News & Data Services Client Support Bidweek Markets | Natural Gas Prices | NGI All News Access
Major fluctuations in the latest weather models resulted in big swings in natural gas bidweek
prices, with solid gains on the East Coast and out West. However, much of the countryβ s midsection
posted hefty
window.dojoRequire ( [ `` mojo/signup-forms/Loader '' ], function ( L) { L.start ( { `` baseUrl '':
'' mc.us4.list-manage.com '', '' uuid '': '' 2a6df7ce0f3230ba1f5efe12c '', '' lid '': '' 1e23cc3ebd
'', '' uniqueMethods '': true }) }) American consumers are more concerned about the planet than
steady economic growth, new report. Your company wants to be a part of this. What steps do you
take? Each company should create detailed reports that evaluate the environmental impact of the
business, num
def remove_scrapping_artifacts(df: pd.DataFrame, column: str) -> pd.DataFrame:
text_artifacts = [
"By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site navigation, analyse site usage and support us in providing free open access scientific content. More info.",
"Sign in to get the best natural gas news and data. Follow the topics you want and receive the daily emails. Your email address * Your password * Remember me Continue Reset password Featured Content News & Data Services Client Support"
]
regex_artifacts = [
r"window.dojoRequire \( \[ .*\}\) \}\) "
]
for pattern in text_artifacts:
articles_df[column] = articles_df[column].str.replace(pattern, '', regex=False)
for pattern in regex_artifacts:
articles_df[column] = articles_df[column].str.replace(pattern, '', regex=True)
return df
articles_df = remove_scrapping_artifacts(articles_df, "article")
articles_df.article.map(lambda x: x[:50]).value_counts()
article
Daily GPI Energy Transition | Infrastructure | NG 38
Daily GPI E & P | NGI All News Access The U.S. na 36
Daily GPI Energy Transition | NGI All News Access 28
None of these red flags by themselves make a compa 19
Daily GPI Markets | Natural Gas Prices | NGI All 17
..
Award winning cleantech firm Aceleronβ s repairab 1
Generating safe, green energy is one thing but pr 1
Countries around the world need to move further a 1
The sun is arguably the most important renewable 1
Arevon Energy Inc. has closed financing on the Vik 1
Name: count, Length: 8749, dtype: int64
Our efforts have successfully eliminated a substantial portion of the scrapping artifacts within the articles. However, some traces still persist, likely remnants of past website navigation structures. While removing these remaining artifacts could offer further refinement, it also presents a significant challenge. Therefore, we'll acknowledge this for now and move onto further preprocessing such as filtering out articles that are not written in English.
articles_df["lang"] = articles_df["article"].map(detect)
articles_df["lang"].value_counts()
lang en 9588 de 4 ru 1 Name: count, dtype: int64
We will first inspect the language-specific assessment of our texts.
articles_df[articles_df["lang"] != "en"]
| title | content | domain | url | article | lang | |
|---|---|---|---|---|---|---|
| 8283 | International Energy Storage Conference ( IRES... | ['EUROSOLAR veranstaltet vom 16. bis 18. MΓ€rz ... | eurosolar | https://www.eurosolar.de/2021/01/26/internatio... | EUROSOLAR veranstaltet vom 16. bis 18. MΓ€rz 20... | de |
| 8304 | Open Letter to Presidents Putin, Biden, Zelens... | ['EUROSOLAR, the European Association for Rene... | eurosolar | https://www.eurosolar.de/sektionen/russland/ | EUROSOLAR, the European Association for Renewa... | ru |
| 8307 | Internationale Konferenz fΓΌr Energiespeicher m... | ['Die nun zu Ende gegangene β Internationale E... | eurosolar | https://www.eurosolar.de/2022/09/26/internatio... | Die nun zu Ende gegangene β Internationale Ern... | de |
| 8308 | Presentations, Poster and Photos of the IRES 2022 | ['Photos from the IRES ( Copyright EUROSOLAR e... | eurosolar | https://www.eurosolar.de/2022/10/20/presentati... | Photos from the IRES ( Copyright EUROSOLAR e.V... | de |
| 24652 | SMS group liefert Prozesstechnologie fΓΌr das e... | ['Β© SMS group liefert Prozesstechnologie fΓΌr d... | decarbxpo | https://www.decarbxpo.com/en/News_Media/Magazi... | Β© SMS group liefert Prozesstechnologie fΓΌr das... | de |
print(wrap_text(articles_df[articles_df["lang"] != "en"].iloc[1]["article"][1000:]))
suffering and misery for over a century, while distracting from the one common enemy threatening to consume all: accelerated fossil fueled climate heating. The Ukraineβ s EUROSOLAR section and its networks have long advocated a new age with renewable energy in Eastern Europe. Together with all of our other sections and members across the European continent, from Russia to the Netherlands, and from Turkey to Denmark, EUROSOLAR offers this Climate Peace Platform. Prof. Peter Droege, President of EUROSOLAR: β The time has come for Climate Peace Diplomacy, to confront everyoneβ s common enemy: advanced fossil climate destabilization. This is one of ten actions presented by EUROSOLAR as the main agenda of our time. β Dr. Brigitte Schmidt, Vice President and Board Member of EUROSOLAR Germany: β The time for renewable peace has come, part of our Regenerative Earth Decade program. It stands for rethinking and peaceful action for our common future.β Since its very foundation in 1988 EUROSOLAR has worked to end fossil fuel wars through the great switch to 100% renewable energy. In the words of Hermann Scheer ( 1944-2010), founder of EUROSOLAR: β Renewable energies build peaceβ. The age of fossil-nuclear threats must end, the existential focus must begin: www.earthdecade.org. EUROSOLAR also calls for a shift in thinking towards climate peace diplomacy that recognizes and combats fossil dependencies as humanityβ s greatest common enemy. https: //www.eurosolar.org/en/2022/02/01/regenerative-earth-decade-eurosolars-call-for-climate-peace-diplom cy/ ΠΡΠ΄ΠΊΡΠΈΡΠΈΠΉ Π»ΠΈΡΡ ΠΏΡΠ΅Π·ΠΈΠ΄Π΅Π½ΡΠ°ΠΌ ΠΡΡΡΠ½Ρ, ΠΠ°ΠΉΠ΄Π΅Π½, ΠΠ΅Π»Π΅Π½ΡΡΠΊΠΈΠΉ Ρ ΠΡΠΊΠ°ΡΠ΅Π½ΠΊΠΎ: Eurosolar, ΠΠ²ΡΠΎΠΏΠ΅ΠΉΡΡΠΊΠ° Π°ΡΠΎΡΡΠ°ΡΡΡ Π²ΡΠ΄Π½ΠΎΠ²Π»ΡΠ²Π°Π½ΠΎΡ Π΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠΈ, Π·Π°ΠΊΠ»ΠΈΠΊΠ°Ρ Π΄ΠΎ Π½Π΅Π³Π°ΠΉΠ½ΠΎΠ³ΠΎ ΠΏΡΠΈΠΏΠΈΠ½Π΅Π½Π½Ρ Π²ΠΎΠ³Π½Ρ ΡΠ° ΠΏΠΎΡΡΡΠΉΠ½ΠΎΡ ΠΌΠΈΡΠ½ΠΎΡ ΡΠ³ΠΎΠ΄ΠΈ ΠΏΠΎ Π²ΡΡΠΉ Π‘Ρ ΡΠ΄Π½ΡΠΉ ΠΠ²ΡΠΎΠΏΡ, Π±Π΅ΡΡΡΠΈ ΡΡΠ°ΡΡΡ Ρ Π²ΡΠ΅ΡΡΠΎΡΠΎΠ½Π½ΡΠΉ ΠΊΠ»ΡΠΌΠ°ΡΠΈΡΠ½ΡΠΉ ΠΌΠΈΡΠ½ΡΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°ΡΡΡ. ΠΠ°ΠΏΠ°Π΄ ΡΠΎΡΡΠΉΡΡΠΊΠΈΡ Π²ΡΠΉΡΡΠΊΠΎΠ²ΠΈΡ Π½Π° ΡΠΊΡΠ°ΡΠ½ΡΡΠΊΠΈΠΉ Π½Π°ΡΠΎΠ΄ Ρ ΠΉΠΎΠ³ΠΎ ΡΡΡΠ΄ ΠΏΠΎΠ²ΠΈΠ½Π΅Π½ Π±ΡΡΠΈ Π·Π°ΡΡΠ΄ΠΆΠ΅Π½ΠΈΠΉ Π½Π°ΠΉΡΡΡΡΡΡΡΠΈΠΌ ΡΠΈΠ½ΠΎΠΌ Ρ ΠΏΠΎΠ²ΠΈΠ½Π΅Π½ Π½Π΅Π³Π°ΠΉΠ½ΠΎ ΠΏΡΠΈΠΏΠΈΠ½ΠΈΡΠΈΡΡ. ΠΡΡ ΠΊΡΠ°ΡΠ½ΠΈ, ΡΠΊΡ Π²ΠΈΠΊΠΎΡΠΈΡΡΠΎΠ²ΡΡΡΡ Π²ΡΠΉΡΡΠΊΠΎΠ²Ρ Π°Π»ΡΡΠ½ΡΠΈ Π΄Π»Ρ ΠΏΠΎΡΡΡΠΉΠ½ΠΎΠ³ΠΎ ΠΊΠΎΡΠΈΠ³ΡΠ²Π°Π½Π½Ρ ΡΡΠ΅Ρ ΡΠ½ΡΠ΅ΡΠ΅ΡΡΠ² Ρ ΠΏΠΎΡΡΡΠΉΠ½ΠΎ ΠΆΠΎΠΊΠ΅Ρ Π΄Π»Ρ ΡΠ°ΠΊΡΠΈΡΠ½ΠΈΡ Ρ ΡΡΡΠ°ΡΠ΅Π³ΡΡΠ½ΠΈΡ ΠΏΠ΅ΡΠ΅Π²Π°Π³, ΠΏΠΎΠ²ΠΈΠ½Π½Ρ ΠΏΡΠΈΠΏΠΈΠ½ΠΈΡΠΈ ΡΠ²ΠΎΡ Π΄Π΅ΡΡΠ°Π±ΡΠ»ΡΠ·ΡΡΡΡ ΠΏΡΠ°ΠΊΡΠΈΠΊΡ. ΠΡΡ ΡΡΠΎΡΠΎΠ½ΠΈ ΠΏΠΎΠ²ΠΈΠ½Π½Ρ ΠΏΡΠΎΠΊΠΈΠ½ΡΡΠΈΡΡ: ΠΌΠΈ Π½Π΅ ΡΡΠ»ΡΠΊΠΈ Π²ΡΡ Π΄ΠΈΠ²Π»ΡΠΌΠΎΡΡ Π² ΡΠ΄Π΅ΡΠ½Ρ ΠΏΡΡΡΠ²Ρ ΡΠ΅ΡΠ΅Π· ΡΡΠΈΠ²Π°Π»Ρ Π½Π΅Π²Π΄Π°Π»Ρ ΡΠΏΡΠΎΠ±ΠΈ ΡΠΎΠ·Π·Π±ΡΠΎΡΠ½Π½Ρ β ΠΏΠ»Π°Π½Π΅ΡΠ° ΡΠ°ΠΊΠΎΠΆ Π·Π½Π°Ρ ΠΎΠ΄ΠΈΡΡΡΡ Π² Π»Π΅ΡΠ°ΡΠ°Ρ Π½Π΅ΠΊΠΎΠ½ΡΡΠΎΠ»ΡΠΎΠ²Π°Π½ΠΎΡ ΠΊΠ»ΡΠΌΠ°ΡΠΈΡΠ½ΠΎΡ ΡΠΏΡΡΠ°Π»Ρ, ΡΠΊΠ° ΠΏΡΠ°ΠΊΡΠΈΡΠ½ΠΎ Π½Π°ΠΏΠ΅Π²Π½ΠΎ Π·ΡΠΎΠ±ΠΈΡΡ ΡΡ Π½Π΅ΠΏΡΠΈΠ΄Π°ΡΠ½ΠΎΡ Π΄Π»Ρ ΠΆΠΈΡΡΡ Π² ΡΡΠΎΠΌΡ ΠΏΠΎΠΊΠΎΠ»ΡΠ½Π½Ρ. Eurosolar, ΠΠ²ΡΠΎΠΏΠ΅ΠΉΡΡΠΊΠ° Π°ΡΠΎΡΡΠ°ΡΡΡ Π²ΡΠ΄Π½ΠΎΠ²Π»ΡΠ²Π°Π½ΠΎΡ Π΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠΈ, Π·Π°ΠΊΠ»ΠΈΠΊΠ°Ρ Π΄ΠΎ ΠΏΠΎΠ²Π½ΠΎΠ³ΠΎ Ρ ΡΠ²ΠΈΠ΄ΠΊΠΎΠ³ΠΎ ΠΏΠ΅ΡΠ΅Ρ ΠΎΠ΄Ρ Π΄ΠΎ Π²ΡΠ΄Π½ΠΎΠ²Π»ΡΠ²Π°Π½ΠΎΡ Π΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠΈ, ΡΠΎΠ± ΠΏΠΎΠΊΠ»Π°ΡΡΠΈ ΠΊΡΠ°ΠΉ Π·Π°Π»Π΅ΠΆΠ½ΠΎΡΡΡ ΠΠ²ΡΠΎΠΏΠΈ ΡΠ° ΡΠ²ΡΡΡ Π²ΡΠ΄ Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠ³ΠΎ ΠΏΠ°Π»ΠΈΠ²Π°. Π¦Π΅ ΠΏΡΠΈΠ·Π²Π΅Π»ΠΎ Π΄ΠΎ Π½Π΅ΡΠΊΡΠ½ΡΠ΅Π½Π½ΠΎΡ Π²ΡΠΉΠ½ΠΈ, Π½Π΅Π²ΠΈΠΌΠΎΠ²Π½ΠΈΡ ΡΡΡΠ°ΠΆΠ΄Π°Π½Ρ Ρ ΡΡΡΠ°ΠΆΠ΄Π°Π½Ρ ΠΏΡΠΎΡΡΠ³ΠΎΠΌ Π±ΡΠ»ΡΡ Π½ΡΠΆ ΡΡΠΎΠ»ΡΡΡΡ, Π²ΡΠ΄Π²ΠΎΠ»ΡΠΊΠ°ΡΡΠΈ Π²ΡΠ΄ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠΏΡΠ»ΡΠ½ΠΎΠ³ΠΎ Π²ΠΎΡΠΎΠ³Π°, ΡΠΊΠΈΠΉ ΠΏΠΎΠ³ΡΠΎΠΆΡΡ ΡΠΏΠΎΠΆΠΈΠ²Π°ΡΠΈ Π²ΡΠ΅: ΠΏΡΠΈΡΠΊΠΎΡΠ΅Π½Π΅ Π½Π°Π³ΡΡΠ²Π°Π½Π½Ρ ΠΊΠ»ΡΠΌΠ°ΡΡ Π½Π° Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠΌΡ ΠΏΠ°Π»ΠΈΠ²Ρ. Π£ΠΊΡΠ°ΡΠ½ΡΡΠΊΠ° ΡΠ΅ΠΊΡΡΡ EUROSOLAR ΡΠ° ΡΡ ΠΌΠ΅ΡΠ΅ΠΆΡ Π²ΠΆΠ΅ Π΄Π°Π²Π½ΠΎ Π²ΠΈΡΡΡΠΏΠ°ΡΡΡ Π·Π° Π½ΠΎΠ²Ρ Π΅ΠΏΠΎΡ Ρ Π²ΡΠ΄Π½ΠΎΠ²Π»ΡΠ²Π°Π½ΠΎΡ Π΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠΈ Ρ Π‘Ρ ΡΠ΄Π½ΡΠΉ ΠΠ²ΡΠΎΠΏΡ. Π Π°Π·ΠΎΠΌ Π· ΡΡΡΠΌΠ° ΡΠ½ΡΠΈΠΌΠΈ Π½Π°ΡΠΈΠΌΠΈ ΡΠ΅ΠΊΡΡΡΠΌΠΈ ΡΠ° ΡΠ»Π΅Π½Π°ΠΌΠΈ Π½Π° ΡΠ²ΡΠΎΠΏΠ΅ΠΉΡΡΠΊΠΎΠΌΡ ΠΊΠΎΠ½ΡΠΈΠ½Π΅Π½ΡΡ, Π²ΡΠ΄ Π ΠΎΡΡΡ Π΄ΠΎ ΠΡΠ΄Π΅ΡΠ»Π°Π½Π΄ΡΠ², Π° ΡΠ°ΠΊΠΎΠΆ Π²ΡΠ΄ Π’ΡΡΠ΅ΡΡΠΈΠ½ΠΈ Π΄ΠΎ ΠΠ°Π½ΡΡ, EUROSOLAR ΠΏΡΠΎΠΏΠΎΠ½ΡΡ ΡΡ ΠΊΠ»ΡΠΌΠ°ΡΠΈΡΠ½Ρ ΠΌΠΈΡΠ½Ρ ΠΏΠ»Π°ΡΡΠΎΡΠΌΡ. ΠΡΠΎΡ. ΠΡΡΠ΅Ρ ΠΡΠΎΡΠ΄ΠΆ, ΠΡΠ΅Π·ΠΈΠ΄Π΅Π½Ρ EUROSOLAR: β ΠΠ°ΡΡΠ°Π² ΡΠ°Ρ Π΄Π»Ρ ΠΊΠ»ΡΠΌΠ°ΡΠΈΡΠ½ΠΎΡ ΠΌΠΈΡΠ½ΠΎΡ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°ΡΡΡ, ΡΠΎΠ± ΠΏΡΠΎΡΠΈΡΡΠΎΡΡΠΈ ΡΠΏΡΠ»ΡΠ½ΠΎΠΌΡ Π²ΠΎΡΠΎΠ³Ρ ΠΊΠΎΠΆΠ½ΠΎΠ³ΠΎ: ΠΏΠ΅ΡΠ΅Π΄ΠΎΠ²ΡΠΉ Π΄Π΅ΡΡΠ°Π±ΡΠ»ΡΠ·Π°ΡΡΡ Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠ³ΠΎ ΠΊΠ»ΡΠΌΠ°ΡΡ. Π¦Π΅ ΠΎΠ΄Π½Π° Π· Π΄Π΅ΡΡΡΠΈ Π΄ΡΠΉ, ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΡ EUROSOLAR ΡΠΊ ΠΎΡΠ½ΠΎΠ²Π½ΠΈΠΉ ΠΏΠΎΡΡΠ΄ΠΎΠΊ Π΄Π΅Π½Π½ΠΈΠΉ Π½Π°ΡΠΎΠ³ΠΎ ΡΠ°ΡΡ. β Π ΠΌΠΎΠΌΠ΅Π½ΡΡ ΡΠ²ΠΎΠ³ΠΎ Π·Π°ΡΠ½ΡΠ²Π°Π½Π½Ρ Π² 1988 ΡΠΎΡΡ EUROSOLAR ΠΏΡΠ°ΡΡΠ²Π°Π² Π½Π°Π΄ ΠΏΡΠΈΠΏΠΈΠ½Π΅Π½Π½ΡΠΌ Π²ΡΠΉΠ½ΠΈ Π½Π° Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠΌΡ ΠΏΠ°Π»ΠΈΠ²Ρ ΡΠ»ΡΡ ΠΎΠΌ Π²Π΅Π»ΠΈΠΊΠΎΠ³ΠΎ ΠΏΠ΅ΡΠ΅Ρ ΠΎΠ΄Ρ Π½Π° 100% Π²ΡΠ΄Π½ΠΎΠ²Π»ΡΠ²Π°Π½Ρ Π΅Π½Π΅ΡΠ³ΡΡ. ΠΠ° ΡΠ»ΠΎΠ²Π°ΠΌΠΈ ΠΠ΅ΡΠΌΠ°Π½Π° Π¨ΠΈΡΠ° ( 1944-2010), Π·Π°ΡΠ½ΠΎΠ²Π½ΠΈΠΊΠ° EUROSOLAR: Β« ΠΡΠ΄Π½ΠΎΠ²Π»ΡΠ²Π°Π½Ρ Π΄ΠΆΠ΅ΡΠ΅Π»Π° Π΅Π½Π΅ΡΠ³ΡΡ ΡΡΠ²ΠΎΡΡΡΡΡ ΠΌΠΈΡ Β». ΠΠΏΠΎΡ Π° Π²ΠΈΠΊΠΎΠΏΠ½ΠΎ-ΡΠ΄Π΅ΡΠ½ΠΈΡ Π·Π°Π³ΡΠΎΠ· ΠΏΠΎΠ²ΠΈΠ½Π½Π° Π·Π°ΠΊΡΠ½ΡΠΈΡΠΈΡΡ, ΠΏΠΎΠ²ΠΈΠ½Π΅Π½ ΠΏΠΎΡΠ°ΡΠΈΡΡ Π΅ΠΊΠ·ΠΈΡΡΠ΅Π½ΡΡΠ°Π»ΡΠ½ΠΈΠΉ ΡΠΎΠΊΡΡ: www.earthdecade.org. EUROSOLAR ΡΠ°ΠΊΠΎΠΆ Π·Π°ΠΊΠ»ΠΈΠΊΠ°Ρ Π΄ΠΎ Π·ΠΌΡΠ½ΠΈ ΠΌΠΈΡΠ»Π΅Π½Π½Ρ Π² Π±ΡΠΊ ΠΊΠ»ΡΠΌΠ°ΡΠΈΡΠ½ΠΎΡ ΠΌΠΈΡΠ½ΠΎΡ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°ΡΡΡ, ΡΠΊΠ° Π²ΠΈΠ·Π½Π°Ρ Ρ Π±ΠΎΡΠ΅ΡΡΡΡ Π· Π²ΠΈΠΊΠΎΠΏΠ½ΠΈΠΌΠΈ Π·Π°Π»Π΅ΠΆΠ½ΠΎΡΡΡΠΌΠΈ ΡΠΊ Π½Π°ΠΉΠ±ΡΠ»ΡΡΠΈΠΉ ΡΠΏΡΠ»ΡΠ½ΠΈΠΉ Π²ΠΎΡΠΎΠ³ Π»ΡΠ΄ΡΡΠ²Π°. https: //www.eurosolar.org/en/2022/02/01/regenerative-earth-decade-eurosolars-call-for-climate-peace-diplom cy/ ΠΡΠΊΡΡΡΠΎΠ΅ ΠΏΠΈΡΡΠΌΠΎ ΠΏΡΠ΅Π·ΠΈΠ΄Π΅Π½ΡΠ°ΠΌ ΠΡΡΠΈΠ½Ρ, ΠΠ°ΠΉΠ΄Π΅Π½Ρ, ΠΠ΅Π»Π΅Π½ΡΠΊΠΎΠΌΡ ΠΈ ΠΡΠΊΠ°ΡΠ΅Π½ΠΊΠΎ: EUROSOLAR, ΠΠ²ΡΠΎΠΏΠ΅ΠΉΡΠΊΠ°Ρ Π°ΡΡΠΎΡΠΈΠ°ΡΠΈΡ Π²ΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΠΎΠΉ ΡΠ½Π΅ΡΠ³Π΅ΡΠΈΠΊΠΈ, ΠΏΡΠΈΠ·ΡΠ²Π°Π΅Ρ ΠΊ Π½Π΅ΠΌΠ΅Π΄Π»Π΅Π½Π½ΠΎΠΌΡ ΠΏΡΠ΅ΠΊΡΠ°ΡΠ΅Π½ΠΈΡ ΠΊΠ»ΠΈΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΠΎΠ³Π½Ρ ΠΈ Π·Π°ΠΊΠ»ΡΡΠ΅Π½ΠΈΡ ΠΏΠΎΡΡΠΎΡΠ½Π½ΠΎΠ³ΠΎ ΠΊΠ»ΠΈΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΠΌΠΈΡΠ½ΠΎΠ³ΠΎ ΡΠΎΠ³Π»Π°ΡΠ΅Π½ΠΈΡ ΠΏΠΎ Π²ΡΠ΅ΠΉ ΠΠΎΡΡΠΎΡΠ½ΠΎΠΉ ΠΠ²ΡΠΎΠΏΠ΅ β ΠΈ, ΡΠ°ΠΊΠΈΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ, ΠΊ Π½Π°ΡΠ°Π»Ρ ΠΌΠ½ΠΎΠ³ΠΎΡΡΠΎΡΠΎΠ½Π½Π΅ΠΉ ΠΊΠ»ΠΈΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΌΠΈΡΠ½ΠΎΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°ΡΠΈΠΈ. ΠΠ°ΠΏΠ°Π΄Π΅Π½ΠΈΠ΅ ΡΠΎΡΡΠΈΠΉΡΠΊΠΈΡ Π²ΠΎΠ΅Π½Π½ΡΡ Π½Π° ΡΠΊΡΠ°ΠΈΠ½ΡΠΊΠΈΠΉ Π½Π°ΡΠΎΠ΄ ΠΈ Π΅Π³ΠΎ ΠΏΡΠ°Π²ΠΈΡΠ΅Π»ΡΡΡΠ²ΠΎ Π΄ΠΎΠ»ΠΆΠ½ΠΎ Π±ΡΡΡ ΠΎΡΡΠΆΠ΄Π΅Π½ΠΎ ΡΠ°ΠΌΡΠΌ ΡΠ΅ΡΠΈΡΠ΅Π»ΡΠ½ΡΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ ΠΈ Π½Π΅ΠΌΠ΅Π΄Π»Π΅Π½Π½ΠΎ ΠΎΡΡΠ°Π½ΠΎΠ²Π»Π΅Π½ΠΎ. ΠΡΠ΅ ΡΡΡΠ°Π½Ρ, ΠΊΠΎΡΠΎΡΡΠ΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡΡ Π²ΠΎΠ΅Π½Π½ΡΠ΅ ΡΠΎΡΠ·Ρ Π΄Π»Ρ ΠΏΠΎΡΡΠΎΡΠ½Π½ΠΎΠΉ ΠΊΠΎΡΡΠ΅ΠΊΡΠΈΡΠΎΠ²ΠΊΠΈ ΡΠ²ΠΎΠΈΡ ΡΡΠ΅Ρ ΠΈΠ½ΡΠ΅ΡΠ΅ΡΠΎΠ² ΠΈ ΠΏΠΎΡΡΠΎΡΠ½Π½ΠΎΠΉ Π±ΠΎΡΡΠ±Ρ Π·Π° ΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΎΠ΅ ΠΈ ΡΡΡΠ°ΡΠ΅Π³ΠΈΡΠ΅ΡΠΊΠΎΠ΅ ΠΏΡΠ΅ΠΈΠΌΡΡΠ΅ΡΡΠ²ΠΎ, Π΄ΠΎΠ»ΠΆΠ½Ρ ΠΏΡΠ΅ΠΊΡΠ°ΡΠΈΡΡ ΡΠ²ΠΎΡ Π΄Π΅ΡΡΠ°Π±ΠΈΠ»ΠΈΠ·ΠΈΡΡΡΡΡΡ ΠΏΡΠ°ΠΊΡΠΈΠΊΡ. ΠΡΠ΅ Π²ΠΎΠ²Π»Π΅ΡΠ΅Π½Π½ΡΠ΅ ΡΡΠΎΡΠΎΠ½Ρ Π΄ΠΎΠ»ΠΆΠ½Ρ ΠΏΡΠΎΡΠ½ΡΡΡΡΡ: ΠΠ°Π»ΠΎ ΡΠΎΠ³ΠΎ, ΡΡΠΎ ΠΌΡ Π²ΡΠ΅ ΡΠΌΠΎΡΡΠΈΠΌ Π² ΡΠ΄Π΅ΡΠ½ΡΡ Π±Π΅Π·Π΄Π½Ρ ΠΈΠ·-Π·Π° Π΄Π»ΠΈΡΠ΅Π»ΡΠ½ΡΡ Π½Π΅ΡΠ΄Π°ΡΠ½ΡΡ ΠΏΠΎΠΏΡΡΠΎΠΊ ΡΠ°Π·ΠΎΡΡΠΆΠ΅Π½ΠΈΡ β ΠΏΠ»Π°Π½Π΅ΡΠ° ΡΠ°ΠΊΠΆΠ΅ Π½Π°Ρ ΠΎΠ΄ΠΈΡΡΡ Π² Π½Π΅ΠΊΠΎΠ½ΡΡΠΎΠ»ΠΈΡΡΠ΅ΠΌΠΎΠΉ ΠΊΠ»ΠΈΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΡΠΏΠΈΡΠ°Π»ΠΈ, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΏΠΎΡΡΠΈ Π½Π°Π²Π΅ΡΠ½ΡΠΊΠ° ΡΠ΄Π΅Π»Π°Π΅Ρ Π΅Π΅ Π½Π΅ΠΏΡΠΈΠ³ΠΎΠ΄Π½ΠΎΠΉ Π΄Π»Ρ ΠΆΠΈΠ·Π½ΠΈ ΡΠΆΠ΅ Π² ΡΡΠΎΠΌ ΠΏΠΎΠΊΠΎΠ»Π΅Π½ΠΈΠΈ. EUROSOLAR, ΠΠ²ΡΠΎΠΏΠ΅ΠΉΡΠΊΠ°Ρ Π°ΡΡΠΎΡΠΈΠ°ΡΠΈΡ Π²ΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΡΡ ΠΈΡΡΠΎΡΠ½ΠΈΠΊΠΎΠ² ΡΠ½Π΅ΡΠ³ΠΈΠΈ, ΠΏΡΠΈΠ·ΡΠ²Π°Π΅Ρ ΠΊ ΠΏΠΎΠ»Π½ΠΎΠΌΡ ΠΈ Π±ΡΡΡΡΠΎΠΌΡ ΠΏΠ΅ΡΠ΅Ρ ΠΎΠ΄Ρ Π½Π° Π²ΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΡΠ΅ ΠΈΡΡΠΎΡΠ½ΠΈΠΊΠΈ ΡΠ½Π΅ΡΠ³ΠΈΠΈ, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ»ΠΎΠΆΠΈΡΡ ΠΊΠΎΠ½Π΅Ρ Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΠΠ²ΡΠΎΠΏΡ ΠΈ Π²ΡΠ΅Π³ΠΎ ΠΌΠΈΡΠ° ΠΎΡ ΠΈΡΠΊΠΎΠΏΠ°Π΅ΠΌΠΎΠ³ΠΎ ΡΠΎΠΏΠ»ΠΈΠ²Π°. ΠΠ½Π° ΠΏΡΠΈΠ²Π΅Π»Π° ΠΊ Π±Π΅ΡΠΊΠΎΠ½Π΅ΡΠ½ΡΠΌ Π²ΠΎΠΉΠ½Π°ΠΌ, Π½Π΅Π²ΡΡΠ°Π·ΠΈΠΌΡΠΌ ΡΡΡΠ°Π΄Π°Π½ΠΈΡΠΌ ΠΈ Π½Π΅ΡΡΠ°ΡΡΡΡΠΌ Π½Π° ΠΏΡΠΎΡΡΠΆΠ΅Π½ΠΈΠΈ Π±ΠΎΠ»Π΅Π΅ Π²Π΅ΠΊΠ°, ΠΎΡΠ²Π»Π΅ΠΊΠ°Ρ Π½Π°Ρ ΠΎΡ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΠ΅Π³ΠΎ Π²ΡΠ°Π³Π°, ΠΊΠΎΡΠΎΡΡΠΉ ΡΠ³ΡΠΎΠΆΠ°Π΅Ρ ΠΏΠΎΠ³Π»ΠΎΡΠΈΡΡ Π²ΡΠ΅Ρ Π½Π°Ρ: ΡΡΠΊΠΎΡΠ΅Π½Π½ΠΎΠ³ΠΎ Π³Π»ΠΎΠ±Π°Π»ΡΠ½ΠΎΠ³ΠΎ ΠΏΠΎΡΠ΅ΠΏΠ»Π΅Π½ΠΈΡ, Π²ΡΠ·Π²Π°Π½Π½ΠΎΠ³ΠΎ ΠΈΡΠΊΠΎΠΏΠ°Π΅ΠΌΡΠΌ ΡΠΎΠΏΠ»ΠΈΠ²ΠΎΠΌ. Π£ΠΊΡΠ°ΠΈΠ½ΡΠΊΠ°Ρ ΡΠ΅ΠΊΡΠΈΡ EUROSOLAR ΠΈ Π΅Π΅ ΡΠ΅ΡΠΈ Π΄Π°Π²Π½ΠΎ Π²ΡΡΡΡΠΏΠ°ΡΡ Π·Π° Π½ΠΎΠ²ΡΡ ΡΡΡ Ρ Π²ΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΡΠΌΠΈ ΠΈΡΡΠΎΡΠ½ΠΈΠΊΠ°ΠΌΠΈ ΡΠ½Π΅ΡΠ³ΠΈΠΈ Π² ΠΠΎΡΡΠΎΡΠ½ΠΎΠΉ ΠΠ²ΡΠΎΠΏΠ΅. ΠΠΌΠ΅ΡΡΠ΅ ΡΠΎ Π²ΡΠ΅ΠΌΠΈ Π΄ΡΡΠ³ΠΈΠΌΠΈ Π½Π°ΡΠΈΠΌΠΈ ΡΠ΅ΠΊΡΠΈΡΠΌΠΈ ΠΈ ΡΠ»Π΅Π½Π°ΠΌΠΈ ΠΏΠΎ Π²ΡΠ΅ΠΌΡ Π΅Π²ΡΠΎΠΏΠ΅ΠΉΡΠΊΠΎΠΌΡ ΠΊΠΎΠ½ΡΠΈΠ½Π΅Π½ΡΡ, ΠΎΡ Π ΠΎΡΡΠΈΠΈ Π΄ΠΎ ΠΠΈΠ΄Π΅ΡΠ»Π°Π½Π΄ΠΎΠ² ΠΈ ΠΎΡ Π’ΡΡΡΠΈΠΈ Π΄ΠΎ ΠΠ°Π½ΠΈΠΈ, EUROSOLAR ΠΏΡΠ΅Π΄Π»Π°Π³Π°Π΅Ρ ΡΡΡ ΠΏΠ»Π°ΡΡΠΎΡΠΌΡ ΠΌΠΈΡΠ° ΠΊΠ»ΠΈΠΌΠ°ΡΡ. ΠΡΠΎΡΠ΅ΡΡΠΎΡ ΠΠ΅ΡΠ΅Ρ ΠΡΠΎΠ³Π΅, ΠΏΡΠ΅Π·ΠΈΠ΄Π΅Π½Ρ EUROSOLAR: β ΠΠ°ΡΡΠ°Π»ΠΎ Π²ΡΠ΅ΠΌΡ Π΄Π»Ρ ΠΊΠ»ΠΈΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΌΠΈΡΠ½ΠΎΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°ΡΠΈΠΈ, ΡΡΠΎΠ±Ρ ΠΏΡΠΎΡΠΈΠ²ΠΎΡΡΠΎΡΡΡ ΠΎΠ±ΡΠ΅ΠΌΡ Π΄Π»Ρ Π²ΡΠ΅Ρ Π²ΡΠ°Π³Ρ: Π΄Π΅ΡΡΠ°Π±ΠΈΠ»ΠΈΠ·Π°ΡΠΈΠΈ ΠΊΠ»ΠΈΠΌΠ°ΡΠ° Π·Π° ΡΡΠ΅Ρ ΠΏΠ΅ΡΠ΅Π΄ΠΎΠ²ΠΎΠ³ΠΎ ΠΈΡΠΊΠΎΠΏΠ°Π΅ΠΌΠΎΠ³ΠΎ ΡΠΎΠΏΠ»ΠΈΠ²Π°. ΠΡΠΎ ΠΎΠ΄Π½ΠΎ ΠΈΠ· Π΄Π΅ΡΡΡΠΈ Π΄Π΅ΠΉΡΡΠ²ΠΈΠΉ, ΠΊΠΎΡΠΎΡΡΠ΅ EUROSOLAR ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΠ΅Ρ ΠΊΠ°ΠΊ ΡΠ°ΠΌΡΡ Π²Π°ΠΆΠ½ΡΡ ΠΏΠΎΠ²Π΅ΡΡΠΊΡ Π΄Π½Ρ Π½Π°ΡΠ΅Π³ΠΎ Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ β. ΠΠΎΠΊΡΠΎΡ ΠΡΠΈΠ³ΠΈΡΡΠ΅ Π¨ΠΌΠΈΠ΄Ρ, Π²ΠΈΡΠ΅-ΠΏΡΠ΅Π·ΠΈΠ΄Π΅Π½Ρ ΠΈ ΡΠ»Π΅Π½ ΠΏΡΠ°Π²Π»Π΅Π½ΠΈΡ EUROSOLAR ΠΠ΅ΡΠΌΠ°Π½ΠΈΡ: β ΠΠ°ΡΡΡΠΏΠΈΠ»ΠΎ Π²ΡΠ΅ΠΌΡ Π²ΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΠΎΠ³ΠΎ ΠΌΠΈΡΠ°, ΡΠ°ΡΡΡ Π½Π°ΡΠ΅ΠΉ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΡ β ΠΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΠΎΠ΅ Π΄Π΅ΡΡΡΠΈΠ»Π΅ΡΠΈΠ΅ β. ΠΠ½ Π²ΡΡΡΡΠΏΠ°Π΅Ρ Π·Π° ΠΏΠ΅ΡΠ΅ΠΎΡΠΌΡΡΠ»Π΅Π½ΠΈΠ΅ ΠΈ ΠΌΠΈΡΠ½ΡΠ΅ Π΄Π΅ΠΉΡΡΠ²ΠΈΡ Π²ΠΎ ΠΈΠΌΡ Π½Π°ΡΠ΅Π³ΠΎ ΠΎΠ±ΡΠ΅Π³ΠΎ Π±ΡΠ΄ΡΡΠ΅Π³ΠΎ. Π‘ ΠΌΠΎΠΌΠ΅Π½ΡΠ° ΡΠ²ΠΎΠ΅Π³ΠΎ ΠΎΡΠ½ΠΎΠ²Π°Π½ΠΈΡ Π² 1988 Π³ΠΎΠ΄Ρ ΠΊΠΎΠΌΠΏΠ°Π½ΠΈΡ EUROSOLAR ΡΠ°Π±ΠΎΡΠ°Π΅Ρ Π½Π°Π΄ ΡΠ΅ΠΌ, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ»ΠΎΠΆΠΈΡΡ ΠΊΠΎΠ½Π΅Ρ Π²ΠΎΠΉΠ½Π°ΠΌ Π·Π° ΠΈΡΠΊΠΎΠΏΠ°Π΅ΠΌΠΎΠ΅ ΡΠΎΠΏΠ»ΠΈΠ²ΠΎ ΠΏΡΡΠ΅ΠΌ ΠΌΠ°ΡΡΡΠ°Π±Π½ΠΎΠ³ΠΎ ΠΏΠ΅ΡΠ΅Ρ ΠΎΠ΄Π° Π½Π° 100% Π²ΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΡΠ΅ ΠΈΡΡΠΎΡΠ½ΠΈΠΊΠΈ ΡΠ½Π΅ΡΠ³ΠΈΠΈ. ΠΠΎ ΡΠ»ΠΎΠ²Π°ΠΌ ΠΠ΅ΡΠΌΠ°Π½Π° Π¨Π΅Π΅ΡΠ° ( 1944-2010), ΠΎΡΠ½ΠΎΠ²Π°ΡΠ΅Π»Ρ EUROSOLAR: β ΠΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΡΠ΅ ΠΈΡΡΠΎΡΠ½ΠΈΠΊΠΈ ΡΠ½Π΅ΡΠ³ΠΈΠΈ ΡΠΎΠ·Π΄Π°ΡΡ ΠΌΠΈΡ β. ΠΠ΅ΠΊ ΠΈΡΠΊΠΎΠΏΠ°Π΅ΠΌΠΎ-ΡΠ΄Π΅ΡΠ½ΡΡ ΡΠ³ΡΠΎΠ· Π΄ΠΎΠ»ΠΆΠ΅Π½ Π·Π°ΠΊΠΎΠ½ΡΠΈΡΡΡΡ, Π΄ΠΎΠ»ΠΆΠ½Π° Π½Π°ΡΠ°ΡΡΡΡ ΡΠΊΠ·ΠΈΡΡΠ΅Π½ΡΠΈΠ°Π»ΡΠ½Π°Ρ ΠΎΡΠΈΠ΅Π½ΡΠ°ΡΠΈΡ: https: //www.earthdecade.org. EUROSOLAR ΠΏΡΠΈΠ·ΡΠ²Π°Π΅Ρ ΠΊ ΠΏΠ΅ΡΠ΅ΠΎΡΠΌΡΡΠ»Π΅Π½ΠΈΡ Π² ΡΡΠΎΡΠΎΠ½Ρ ΠΊΠ»ΠΈΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΌΠΈΡΠ½ΠΎΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°ΡΠΈΠΈ, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΏΡΠΈΠ·Π½Π°Π΅Ρ ΠΈ Π±ΠΎΡΠ΅ΡΡΡ Ρ ΠΈΡΠΊΠΎΠΏΠ°Π΅ΠΌΠΎΠΉ Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΡΡ ΠΊΠ°ΠΊ Π²Π΅Π»ΠΈΡΠ°ΠΉΡΠΈΠΌ ΠΎΠ±ΡΠΈΠΌ Π²ΡΠ°Π³ΠΎΠΌ ΡΠ΅Π»ΠΎΠ²Π΅ΡΠ΅ΡΡΠ²Π°.https: //www.eurosolar.org/en/2022/02/01/regenerative-earth-decade-eurosolars-call-for-climate-peace-diplom cy/ Independent of political parties, institutions, companies and interest groups, EUROSOLAR has been developing and stimulating political and economic action drafts and concepts for the introduction of renewable energies since 1988. This ranges from market introduction strategies to proposals for further research and development policy, from tax policy subsidies to arms conversion with solar energy, from the contribution of solar energy for the Global South to agricultural, transport and construction policy. EuropΓ€ische Vereinigung fΓΌr Erneuerbare Energien e. V.
articles_df = articles_df[articles_df["lang"] == "en"]
Our exploration revealed a small number of articles containing non-English content (some in German and 1 with a Russian section). Since most LLMs and embedding models are primarily trained on English text, removing these articles ensures compatibility with our chosen models for this notebook. For simplicity, we'll only focus on supporting English queries and responses within this RAG pipeline.
Introducing multilingual capabilities into a RAG pipeline presents an additional layer of complexity. Here's a breakdown of some key challenges:
Let us further analyze the contents of the articles. However, before we do so let us define the meaning of characters, tokens and words:
sns.histplot(articles_df["article"].map(len), kde=True)
plt.title("Amount of characters in articles")
plt.xlabel("Amount of characters")
plt.ylabel("Number of articles")
median_char_len = articles_df["article"].map(len).median()
mean_char_len = articles_df["article"].map(len).mean()
plt.axvline(median_char_len, color='r', linestyle='--', label=f"Median character amount: {median_char_len:.2f}")
plt.axvline(mean_char_len, color='g', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
sns.histplot(articles_df["article"].map(lambda x: len(x.split())), kde=True)
plt.title("Amount of words in articles")
plt.xlabel("Amount of words")
plt.ylabel("Number of articles")
median_word_len = articles_df["article"].map(lambda x: len(x.split())).median()
mean_word_len = articles_df["article"].map(lambda x: len(x.split())).mean()
plt.axvline(median_word_len, color='r', linestyle='--', label=f"Median word amount: {median_word_len:.2f}")
plt.axvline(mean_word_len, color='g', linestyle='--', label=f"Mean word amount: {mean_word_len:.2f}")
plt.legend()
plt.show()
nlp = English()
tokenizer = nlp.tokenizer
sns.histplot(articles_df["article"].map(lambda x: len(tokenizer(x))), kde=True)
plt.title("Amount of tokens in articles")
plt.xlabel("Amount of tokens")
plt.ylabel("Number of articles")
median_token_len = articles_df["article"].map(lambda x: len(tokenizer(x))).median()
mean_token_len = articles_df["article"].map(lambda x: len(tokenizer(x))).mean()
plt.axvline(median_token_len, color='r', linestyle='--', label=f"Median token amount: {median_token_len:.2f}")
plt.axvline(mean_token_len, color='g', linestyle='--', label=f"Mean token amount: {mean_token_len:.2f}")
plt.legend()
plt.show()
all_tokens = [token.text for article in articles_df["article"] for token in tokenizer(article)]
# remove non-alphabetic tokens such as punctuation
alpha_tokens = [token for token in all_tokens if token.isalpha()]
alpha_tokens = [token.lower() for token in alpha_tokens]
alpha_token_counts = Counter(alpha_tokens)
sns.barplot(
x=[count for token, count in alpha_token_counts.most_common(20)],
y=[token for token, count in alpha_token_counts.most_common(20)],
hue=[token for token, count in alpha_token_counts.most_common(20)]
)
plt.title("Most common alphabetic tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
The initial approach returns common words which do not reflect the subject-specific nature of our document collection. We will remove them to understand the content of the texts better.
# remove stopwords such as 'the', 'a', 'and'
non_stop_tokens = [token for token in alpha_tokens if not nlp.vocab[token].is_stop]
non_stop_token_counts = Counter(non_stop_tokens)
sns.barplot(
x=[count for token, count in non_stop_token_counts.most_common(20)],
y=[token for token, count in non_stop_token_counts.most_common(20)],
hue=[token for token, count in non_stop_token_counts.most_common(20)]
)
plt.title("Most common non-stopword tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
As one would expect in a dataset of cleantech news articles most of the tokens that are not punctation or stopwords revolve around the subjects of energy, climate, and technology. This is a good sign that the dataset is relevant to the topic at hand. The "s" token comes up frequently, which is likely due to the possessive form of words. With an average of around 700 words per article, we can expect a good amount of information to be present in each article and an average reading time of around 3-4 minutes.
The Flesch Reading Ease Score (FRES, a.k.a Flesch-Kincaid Reading Ease Score) is a heuristic used to evaluate how easy it is to understand a text based on the length of sentences and the number of syllables per word. Scores can range from -100 (very difficult to read) to 100 (very easy to read). Scores below 50 are indicative of difficult texts for College level. This metric can be useful for assessing the readability of our articles and ensuring they are accessible to a broad audience.
articles_df["readability"] = articles_df["article"].apply(flesch_reading_ease)
sns.histplot(articles_df["readability"], kde=True)
plt.title("Flesch Reading Ease of articles")
plt.xlabel("Flesch Reading Ease")
plt.ylabel("Number of articles")
mean_readability = articles_df["readability"].mean()
plt.axvline(mean_readability, color='g', linestyle='--', label=f"Mean readability: {mean_readability:.2f}")
plt.legend()
plt.show()
We analyze now the diversity of language complexity used by different publishing domains.
domains = articles_df["domain"].unique()
# Setup the subplots based on the number of domains
plots_per_row = 3
num_rows = (len(domains) + 2) // plots_per_row
plot_height = 6
fig, axes = plt.subplots(num_rows, plots_per_row, figsize=(plot_height * plots_per_row, plot_height * num_rows))
axes = axes.flatten() # Flatten the axes array for easier iteration
# Plot for each domain
for i, domain in enumerate(domains):
domain_articles = articles_df[articles_df["domain"] == domain]
sns.histplot(domain_articles["readability"], kde=True, ax=axes[i], bins=30)
axes[i].set_title(f'Readability of {domain}')
axes[i].set_xlabel('Flesch Reading Ease Score')
axes[i].set_ylabel("Number of articles")
mean_readability = domain_articles["readability"].mean()
axes[i].axvline(mean_readability, color='g', linestyle='--', label=f"Mean readability: {mean_readability:.2f}")
# remove the empty plots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
To gauge the readability of our articles, we calculated the Flesch Reading Ease Score. The average score of around 45 indicates a "fairly easy" reading level, which is positive news. This suggests the content is likely accessible to a broad audience and, consequently, understandable by our RAG pipeline as well.
Our analysis revealed a consistent average Flesch Reading Ease Score across most of the identified domains, with minor variations. This indicates a relatively consistent level of readability across different publishers within the dataset.
Finally we will save the cleaned dataset to a new file in the data/silver folder.
silver_folder = data_folder / "silver"
if not silver_folder.exists():
silver_folder.mkdir()
articles_df.to_csv(silver_folder / "articles.csv", index=False)
Next we will analyze the provided evaluation data and ensure that they match the content of the articles.
human_eval_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 23 entries, 1 to 23 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 question_id 23 non-null int64 1 question 23 non-null object 2 relevant_chunk 23 non-null object 3 article_url 23 non-null object dtypes: int64(1), object(3) memory usage: 920.0+ bytes
human_eval_df.rename(columns={"relevant_chunk":"relevant_section","article_url": "url"}, inplace=True)
human_eval_df.drop(columns=["question_id"], inplace=True)
human_eval_df.head()
| question | relevant_section | url | |
|---|---|---|---|
| example_id | |||
| 1 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | https://www.sgvoice.net/strategy/technology/23... |
| 2 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | https://www.sgvoice.net/policy/25396/eu-seeks-... |
| 3 | What is the EUβs Green Deal Industrial Plan? | The European counterpart to the US Inflation R... | https://www.pv-magazine.com/2023/02/02/europea... |
| 4 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | https://www.sgvoice.net/policy/25396/eu-seeks-... |
| 5 | When did the cooperation between GM and Honda ... | What caught our eye was a new hookup between G... | https://cleantechnica.com/2023/05/08/general-m... |
sns.histplot(human_eval_df["question"].map(len), kde=True)
plt.title("Question Character Length Distribution")
plt.xlabel("Character Length")
plt.ylabel("Count")
mean_char_len = human_eval_df["question"].map(len).mean()
plt.axvline(mean_char_len, color='r', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
missing_articles = human_eval_df.copy()
missing_articles = missing_articles[~human_eval_df["url"].isin(articles_df["url"])]
missing_articles
| question | relevant_section | url | |
|---|---|---|---|
| example_id | |||
| 1 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | https://www.sgvoice.net/strategy/technology/23... |
| 2 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | https://www.sgvoice.net/policy/25396/eu-seeks-... |
| 4 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | https://www.sgvoice.net/policy/25396/eu-seeks-... |
Our exploration has identified instances where articles linked to specific questions appear to be missing from the dataset. To determine the root cause, let's investigate whether these articles are genuinely absent or if inconsistencies in URL formatting are creating the illusion of missing data. Normalizing the URLs across the dataset will help us differentiate between these two scenarios.
def normalize_url(url: str) -> str:
url = url.replace("https://", "")
url = url.replace("http://", "")
url = url.replace("www.", "")
url = url.rstrip("/")
return url
articles_df["url"] = articles_df["url"].map(normalize_url)
human_eval_df["url"] = human_eval_df["url"].map(normalize_url)
missing_articles = human_eval_df.copy()
missing_articles = missing_articles[~human_eval_df["url"].isin(articles_df["url"])]
missing_articles
| question | relevant_section | url | |
|---|---|---|---|
| example_id | |||
| 1 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | sgvoice.net/strategy/technology/23971/leclanch... |
| 2 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | sgvoice.net/policy/25396/eu-seeks-competitive-... |
| 4 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | sgvoice.net/policy/25396/eu-seeks-competitive-... |
We also know from previous analysis that some duplicate articles from the "energyvoice" domain so we will also normalize these URLs.
missing_articles["url"] = missing_articles["url"].map(lambda x: x.replace("sgvoice.net", "sgvoice.energyvoice.com"))
missing_articles[~missing_articles["url"].isin(articles_df["url"])]
| question | relevant_section | url | |
|---|---|---|---|
| example_id |
human_eval_df.loc[missing_articles.index, "url"] = missing_articles["url"]
human_eval_df[human_eval_df["url"].isin(articles_df["url"])]
| question | relevant_section | url | |
|---|---|---|---|
| example_id | |||
| 1 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | sgvoice.energyvoice.com/strategy/technology/23... |
| 2 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | sgvoice.energyvoice.com/policy/25396/eu-seeks-... |
| 3 | What is the EUβs Green Deal Industrial Plan? | The European counterpart to the US Inflation R... | pv-magazine.com/2023/02/02/european-commission... |
| 4 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | sgvoice.energyvoice.com/policy/25396/eu-seeks-... |
| 5 | When did the cooperation between GM and Honda ... | What caught our eye was a new hookup between G... | cleantechnica.com/2023/05/08/general-motors-se... |
| 6 | Did Colgate-Palmolive enter into PPA agreement... | Scout Clean Energy, a Colorado-based renewable... | solarindustrymag.com/scout-and-colgate-palmoli... |
| 7 | What is the status of ZeroAvia's hydrogen fuel... | In December, the US startup ZeroAvia announced... | cleantechnica.com/2023/01/02/the-wait-for-hydr... |
| 8 | What is the "Danger Season"? | As spring turns to summer and the days warm up... | cleantechnica.com/2023/05/15/what-does-a-norma... |
| 9 | Is Mississipi an anti-ESG state? | Mississippi is among two dozen or so states in... | cleantechnica.com/2023/05/15/mississippi-takes... |
| 10 | Can you hang solar panels on garden fences? | Scaling down from the farm to the garden level... | cleantechnica.com/2023/05/18/solar-panels-for-... |
| 11 | Who develops quality control systems for ocean... | Scientists from the Chinese Academy of Science... | azocleantech.com/news.aspx?newsID=32873 |
| 12 | Why are milder winters detrimental for grapes ... | Since grapes and apples are perennial species,... | azocleantech.com/news.aspx?newsID=33040 |
| 13 | What are the basic recycling steps for solar p... | There are some simple recycling steps that can... | azocleantech.com/news.aspx?newsID=33143 |
| 14 | Why does melting ice contribute to global warm... | Whereas white ice reflects the sun's rays, a d... | azocleantech.com/news.aspx?newsID=33149 |
| 15 | Does the Swedish government plan bans on new p... | The Swedish government has proposed a ban on n... | azocleantech.com/news.aspx?newsID=33174 |
| 16 | Where do the turbines used in Icelandic geothe... | Minister Nishimura mentioned that most geother... | thinkgeoenergy.com/japan-and-iceland-agree-on-... |
| 17 | Who is the target user for Leapfrog Energy? | OβBrien added, βSubsurface specialists need fl... | thinkgeoenergy.com/seequent-expands-subsurface... |
| 18 | What is Agrivoltaics? | Agrivoltaics, the integration of food producti... | pv-magazine.com/2023/03/31/new-software-modeli... |
| 19 | What is Agrivoltaics? | Agrivoltaics refers to the conduct of agricult... | cleantechnica.com/2022/12/18/agrivoltaics-goes... |
| 20 | Why is cannabis cultivation moving indoors? | Cannabis cultivation can take place outdoors, ... | pv-magazine.com/2023/04/08/high-time-for-solar... |
| 21 | What are the obstacles for cannabis producers ... | βThere are a lot of prevailing headwinds for c... | pv-magazine.com/2023/04/08/high-time-for-solar... |
| 22 | In 2021, what were the top 3 states in the US ... | In 2021, Florida surpassed North Carolina to b... | cleantechnica.com/2023/04/10/solar-power-in-fl... |
| 23 | Which has the higher absorption coefficient fo... | We chose amorphous germanium instead of amorph... | pv-magazine.com/2021/01/15/germanium-based-sol... |
In the end we are able to find all the articles that are linked to the evaluation data and have therefore successfully completed our exploratory data analysis and preprocessing.
For faster processing and to reduce the cost of running the notebook we will subsample the dataset to 1000 articles. This will allow us to run the notebook in a reasonable amount of time and still provide meaningful results. Because the distribution of articles across publishers is skewed we will use stratified sampling to ensure that we have a representative sample. We also need to keep in mind that the evaluation data are linked to specific articles so we need to make sure that these are included in the subsample.
eval_articles_df = articles_df[articles_df["url"].isin(human_eval_df["url"])]
eval_articles_df.head()
| title | content | domain | url | article | lang | readability | |
|---|---|---|---|---|---|---|---|
| 6780 | LeclanchΓ©β s new disruptive battery boosts ene... | ['Energy storage company LeclanchΓ© ( SW.LECN) ... | energyvoice | sgvoice.energyvoice.com/strategy/technology/23... | Energy storage company LeclanchΓ© ( SW.LECN) ha... | en | 43.22 |
| 6805 | EU seeks competitive boost with Green Deal Ind... | ['The EU has presented its β Green Deal Indust... | energyvoice | sgvoice.energyvoice.com/policy/25396/eu-seeks-... | The EU has presented its β Green Deal Industri... | en | 34.70 |
| 16367 | Agrivoltaics Goes Nuclear On California Prairie | ['A decommissioned nuclear power plant from th... | cleantechnica | cleantechnica.com/2022/12/18/agrivoltaics-goes... | A decommissioned nuclear power plant from the ... | en | 42.00 |
| 16402 | The Wait For Hydrogen Fuel Cell Electric Aircr... | ['The US firm ZeroAvia is one step closer to b... | cleantechnica | cleantechnica.com/2023/01/02/the-wait-for-hydr... | The US firm ZeroAvia is one step closer to bri... | en | 50.46 |
| 16725 | Solar Power In Florida | ['Many renewable energy endeavors in Florida a... | cleantechnica | cleantechnica.com/2023/04/10/solar-power-in-fl... | Many renewable energy endeavors in Florida are... | en | 44.75 |
print(eval_articles_df["url"].unique().shape)
print(human_eval_df["url"].unique().shape)
(21,) (21,)
def do_stratification(
df: pd.DataFrame,
column: str,
sample_size: int,
seed: int = 42
) -> pd.DataFrame:
res_df = df.copy()
indx = df.groupby(column, group_keys=False)[column].apply(lambda x: x.sample(n=int(sample_size/len(df) * len(x)), random_state=seed)).index.to_list()
return res_df.loc[indx]
sample_df = do_stratification(articles_df, "domain", 1000, 69)
# if the articles are already in the subsample from the evaluation set, then we remove them, so we just want unique urls
sample_df = sample_df[~sample_df["url"].isin(eval_articles_df["url"])]
sample_df = pd.concat([sample_df, eval_articles_df])
sample_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 1011 entries, 38325 to 81779 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 1011 non-null object 1 content 1011 non-null object 2 domain 1011 non-null object 3 url 1011 non-null object 4 article 1011 non-null object 5 lang 1011 non-null object 6 readability 1011 non-null float64 dtypes: float64(1), object(6) memory usage: 63.2+ KB
To make sure that the distributional characteristics has not been changed by subsampling we visualize and compare both data sets in relative terms.
original_domain_counts = articles_df["domain"].value_counts().to_frame()
original_domain_counts = original_domain_counts / original_domain_counts.sum() * 100
domain_counts_df = original_domain_counts.copy()
domain_counts_df["type"] = "Original"
sample_domain_counts = sample_df["domain"].value_counts().to_frame()
sample_domain_counts = sample_domain_counts / sample_domain_counts.sum() * 100
sample_domain_counts["type"] = "Sample"
domain_counts_df = pd.concat([domain_counts_df, sample_domain_counts])
sns.barplot(
x=domain_counts_df.index,
y=domain_counts_df["count"],
hue=domain_counts_df["type"]
)
plt.title("Domain Distribution")
plt.xlabel("Domain")
plt.ylabel("Percentage")
plt.xticks(rotation=90)
plt.show()
Now all is prepared to start developing our RAG!
Chunking is a crucial step in the RAG pipeline. It involves breaking down the articles into smaller, more manageable pieces.

There are mainly two reasons for this:
Let's start by getting a better feeling for the most common size of chunks based on the number of characters
def get_lorem_text(num_chars: int) -> str:
expected_avg_word_len = 3 # on the lower side to be safe
text = lorem.words(num_chars // expected_avg_word_len)
return text[:num_chars]
print(wrap_text(get_lorem_text(256)))
repellendus nobis veritatis voluptatem fugit vero odit tenetur ipsam culpa ab officia quas rerum nihil nemo veniam iure eveniet nesciunt quidem error impedit officiis neque enim consequatur fugiat illum fuga voluptatibus magni dolor tempore maxime nostrum
print(wrap_text(get_lorem_text(512)))
debitis tenetur ipsa impedit quod facilis ipsam deserunt quia iste eum quasi alias provident ducimus numquam aliquid maxime similique veritatis iure tempora doloribus facere inventore fuga quos omnis necessitatibus soluta expedita maiores dolores incidunt nihil rem laboriosam sunt vel totam itaque voluptates exercitationem sequi dolorum molestiae sapiente architecto ad ullam commodi iusto corporis eligendi velit perferendis laborum dicta odit dolor cumque accusamus ea distinctio nisi consectetur et quidem p
print(wrap_text(get_lorem_text(1024)))
earum maiores quibusdam nam reprehenderit eum voluptatibus mollitia nisi magni quas autem optio molestias natus expedita totam eius quia atque quod sit ad iste qui ullam corrupti in ipsum accusantium hic eos illo rerum voluptatem fugiat iure assumenda distinctio nobis consequuntur itaque ea possimus molestiae amet fuga animi dolores temporibus dolore tempore explicabo corporis nesciunt consectetur sequi quisquam illum minima odit omnis reiciendis repellat repudiandae blanditiis minus non necessitatibus sint obcaecati aliquam ex perspiciatis voluptate culpa unde provident doloribus vel sed suscipit repellendus officiis quaerat libero laborum et quae architecto ut exercitationem soluta vero aut enim laudantium voluptatum accusamus nulla praesentium deserunt id asperiores ipsam similique facere aliquid tempora eligendi ratione sapiente neque cumque dolorem rem delectus dolorum impedit incidunt adipisci esse eveniet ipsa modi perferendis commodi dolor officia magnam doloremque pariatur velit facilis inventore nos
print(wrap_text(get_lorem_text(2048)))
repellat laborum voluptates sint facilis eaque fuga corporis unde labore quia illo id rem at maxime iste quae quos aliquid provident atque consectetur doloremque eligendi non dolore quod pariatur ab rerum quas molestias corrupti sequi blanditiis deserunt qui mollitia temporibus modi sunt harum consequatur asperiores necessitatibus reprehenderit perspiciatis dicta eveniet ad voluptatum totam nesciunt amet nihil voluptate alias facere ut ducimus excepturi aperiam nobis beatae aliquam omnis laudantium cupiditate soluta cum quisquam iusto accusantium exercitationem autem illum neque optio nisi sit fugiat iure recusandae minima earum natus enim aut debitis odit doloribus voluptas magni tempore veritatis voluptatibus commodi veniam molestiae libero et magnam vero eos esse nam fugit voluptatem ipsa porro officia inventore quidem dolores tenetur dolor architecto quo dolorem placeat ipsam minus sapiente ratione ipsum dolorum quam ex quasi ullam nostrum hic delectus in consequuntur numquam laboriosam reiciendis culpa est explicabo ea possimus a saepe nulla tempora maiores dignissimos obcaecati perferendis eius incidunt quibusdam repudiandae suscipit nemo impedit adipisci similique distinctio sed animi officiis velit quaerat odio accusamus cumque assumenda vitae deleniti expedita praesentium vel aspernatur eum error itaque quis repellendus culpa earum eveniet libero cupiditate ea dolorem officia mollitia vitae consequatur veniam repellat delectus illum sapiente ut ex eaque neque inventore consectetur natus officiis quibusdam modi fuga sunt id dolore animi similique reprehenderit nulla magni vel iusto odit dolor architecto nostrum tempore sit perferendis laboriosam corrupti tempora alias dolores iste dolorum cumque enim facere qui tenetur quasi quia autem iure minus obcaecati distinctio soluta et assumenda nam provident possimus blanditiis saepe rerum adipisci debitis accusantium minima nesciunt quam deserunt eius magnam omnis error doloremque quos voluptas consequuntur nobis laudantium amet voluptates quo ipsum aut nemo rat
In this notebook we will be using two different chunking strategies:
To see how different texts get chunked with different strategies and chunk sizes check out the Chunking Visualizer.
def get_recursive_splitter(chunk_size: int, chunk_overlap: int) -> TextSplitter:
return RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", "(?<=\. )", " ", ""],
length_function=len,
)
# the recursive splitter mainly relies on newlines, are there even any? No, so it will focus on sentences.
sample_df["article"].map(lambda x: x.count("\n")).sum()
0
Let us set the device for efficient use of available resources.
# if we can make use of any device that is better than the CPU, we will use it
device = "cpu"
if torch.cuda.is_available():
device = "cuda"
elif torch.backends.mps.is_available():
device = "mps"
model_kwargs = {'device': device, "trust_remote_code": True}
model_kwargs
{'device': 'cuda', 'trust_remote_code': True}
We select three embedding models from HuggingFace to represent our text fragments in numerical forma in a vector space.
embedding_models = {
"mini": HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs=model_kwargs),
"bge-m3": HuggingFaceEmbeddings(model_name="BAAI/bge-m3", model_kwargs=model_kwargs),
"gte": HuggingFaceEmbeddings(model_name="Alibaba-NLP/gte-base-en-v1.5", model_kwargs=model_kwargs),
}
We also define the chunking strategies to be used. The recursive splittering is characterized by the length of chunks and the overlap between adjacent chunks. For the semantic chunking, sentences embedded as dense vectors are merged as long as the cosine distance between two consecutive sentences does not exceed a percentile based threshold.
recursive_256_splitter = get_recursive_splitter(256, 64)
recursive_1024_splitter = get_recursive_splitter(1024, 128)
semantic_splitter = SemanticChunker(
embedding_models["gte"], breakpoint_threshold_type="percentile"
)
splitters = {
"recursive_256": recursive_256_splitter,
"recursive_1024": recursive_1024_splitter,
"semantic": semantic_splitter
}
def chunk_documents(df: pd.DataFrame, text_splitter: TextSplitter):
chunks = []
id = 0
for _, row in tqdm(df.iterrows(), total=len(df)):
article_content = row['article']
title = row['title']
# we add the title to the content as it might be relevant to the question
full_text = title + ": " + article_content
char_chunks = text_splitter.split_text(full_text)
for chunk in char_chunks:
id += 1
# add metadata to the chunk for potential later use
metadata = {
'title': row['title'],
'url': row['url'],
'domain': row['domain'],
'id': id,
}
chunks.append(Document(
page_content=chunk,
metadata=metadata,
))
return chunks
chunks_folder = silver_folder / "chunks"
if not chunks_folder.exists():
chunks_folder.mkdir()
The following function will load existing chunks, prepared for our tutorial to speed up the preparation process.
def get_or_create_chunks(df: pd.DataFrame, text_splitter: TextSplitter, splitter_name: str) -> List[Document]:
chunks_file = chunks_folder / f"{splitter_name}_chunks.json"
if chunks_file.exists():
with open(chunks_file, "r") as file:
chunks = [Document(**chunk) for chunk in json.load(file)]
print(f"Loaded {len(chunks)} chunks from {chunks_file}")
else:
chunks = chunk_documents(df, text_splitter)
with open(chunks_file, "w") as file:
json.dump([doc.dict() for doc in chunks], file, indent=4)
print(f"Saved {len(chunks)} chunks to {chunks_file}")
return chunks
chunks = {}
for splitter_name, splitter in splitters.items():
chunks[splitter_name] = get_or_create_chunks(sample_df, splitter, splitter_name)
Loaded 25399 chunks from data\silver\chunks\recursive_256_chunks.json Loaded 5754 chunks from data\silver\chunks\recursive_1024_chunks.json Loaded 3144 chunks from data\silver\chunks\semantic_chunks.json
Now that we have created and saved the chunks we can analyze them. We can already see above that the semantic chunks are generally larger than the recursive chunks.
Let's start by looking at the first chunk of the first article to get a feeling for what the chunks look like depending on the chunking strategy and then we will look at the distribution of the chunk sizes and the number of chunks per article.
for splitter_name, splitter_chunks in chunks.items():
print(f"{splitter_name} chunks:")
print(wrap_text(splitter_chunks[0].page_content, char_per_line=150))
print()
recursive_256 chunks: Satellite Vu: Quotes, Address, Contact: Satellite Vu will monitor the temperature of any structure on the planet in near real time. Infrared is the next generation Earth observation sensor and Satellite Vu will be using this data to determine valuable recursive_1024 chunks: Satellite Vu: Quotes, Address, Contact: Satellite Vu will monitor the temperature of any structure on the planet in near real time. Infrared is the next generation Earth observation sensor and Satellite Vu will be using this data to determine valuable insights into economic activity, energy efficiency and carbon footprint. This will enable better business decisions. Bad decisions are being made all over the world. These decisions are having a global impact! Satellite Vu will change these decisions for good. The Sensi+β’ is a laser-based analyzer used for monitoring natural gas quality. The Cypher ES AFM from Oxford Instruments Asylum Research can be utilized for exceptional environmental control. The Vocus CI-TOF from TOFWERK provides real-time chemical ionization measurements. In this interview, AZoCleantech speaks with Tebogo Maleka, National Project Coordinator at the United Nations Industrial Development Organization ( UNIDO), about her role within the organization and the initiative that aims to support semantic chunks: Satellite Vu: Quotes, Address, Contact: Satellite Vu will monitor the temperature of any structure on the planet in near real time. Infrared is the next generation Earth observation sensor and Satellite Vu will be using this data to determine valuable insights into economic activity, energy efficiency and carbon footprint. This will enable better business decisions.
def plot_chunk_lengths(chunks: List[Document], title: str):
sns.histplot([len(chunk.page_content) for chunk in chunks], kde=True)
plt.title(title)
plt.xlabel("Chunk length (characters)")
plt.ylabel("Number of chunks")
median_chunk_len = np.median([len(chunk.page_content) for chunk in chunks])
mean_chunk_len = np.mean([len(chunk.page_content) for chunk in chunks])
plt.axvline(median_chunk_len, color='r', linestyle='--', label=f"Median chunk length: {median_chunk_len:.2f}")
plt.axvline(mean_chunk_len, color='g', linestyle='--', label=f"Mean chunk length: {mean_chunk_len:.2f}")
plt.legend()
plt.show()
plot_chunk_lengths(chunks["recursive_256"], "Chunk lengths for recursive 256 splitter")
plot_chunk_lengths(chunks["recursive_1024"], "Chunk lengths for recursive 1024 splitter")
plot_chunk_lengths(chunks["semantic"], "Chunk lengths for semantic splitter")
chunks_per_article = {splitter_name: Counter([chunk.metadata["title"] for chunk in chunks]) for splitter_name, chunks in chunks.items()}
counts = {splitter_name: [count for title, count in chunk_counts.items()] for splitter_name, chunk_counts in chunks_per_article.items()}
sns.histplot(counts, kde=True)
plt.title("Number of chunks per article")
plt.xlabel("Number of chunks")
plt.ylabel("Number of articles")
plt.legend(chunks_per_article.keys())
plt.show()
From our analysis of our created chunks we can see that the recursive chunks are all around the same size, close to the defined maximum. On the other hand, the semantic chunks vary in size. This is because the semantic chunking strategy is based on the semantic boundaries of the article.
We can also see that despite the semantic chunks being larger, the distribution of the number of chunks per article is much wider for the recursive chunks. This is because the recursive chunks are all around the same size, while the semantic chunks have many smaller ones and a few larger ones.
Now that we have clean chunks, the next step involves generating embeddings for our article chunks. These embeddings will serve as a crucial component for efficient retrieval within the RAG pipeline. For our vector store we'll utilize ChromaDB, a powerful tool for indexing and searching high-dimensional data. To integrate our chosen embedding models with ChromaDB, we'll define a custom wrapper class. This wrapper class will act as an intermediary, ensuring seamless communication between the models and the ChromaDB indexing system.
class CustomChromadbEmbeddingFunction(EmbeddingFunction):
def __init__(self, model) -> None:
super().__init__()
self.model = model
def _embed(self, l):
return [self.model.embed_query(x) for x in l]
def embed_query(self, query):
return self._embed([query])
def __call__(self, input: Documents) -> Embeddings:
embeddings = self._embed(input)
return embeddings
We prepare three different embedding models in this tutorial.
chroma_embedding_functions = {
"mini": CustomChromadbEmbeddingFunction(embedding_models["mini"]),
"bge-m3": CustomChromadbEmbeddingFunction(embedding_models["bge-m3"]),
"gte": CustomChromadbEmbeddingFunction(embedding_models["gte"]),
}
for name, embedding_function in chroma_embedding_functions.items():
sample = embedding_function(["Hello, world!"])[0][:5]
print(f"{name} embedding sample: {sample}")
mini embedding sample: [0.03492265194654465, 0.01883007027208805, -0.017854733392596245, 0.00013882208440918475, 0.07407363504171371] bge-m3 embedding sample: [-0.016155648976564407, 0.02699340134859085, -0.042583219707012177, 0.013542206957936287, -0.01935463584959507] gte embedding sample: [0.03789425268769264, 0.346923828125, -0.20471259951591492, -0.2123868763446808, -0.49100878834724426]
Generating embeddings can be a computationally intensive process. To optimize efficiency and avoid redundant computations, we'll leverage checkpointing. This technique involves storing the generated embeddings along with their corresponding article chunks. We'll define a simple class to encapsulate this data, facilitating efficient retrieval and reducing the need for recalculating embeddings unless absolutely necessary.
embeddings_folder = silver_folder / "embeddings"
if not embeddings_folder.exists():
embeddings_folder.mkdir()
class DocumentEmbedding():
def __init__(self, document: Document, text_embedding: List[float]) -> None:
self.document = document
self.text_embedding = text_embedding
def to_dict(self) -> Dict:
return {
"document": self.document.dict(),
"text_embedding": self.text_embedding
}
@classmethod
def from_dict(cls, d: Dict) -> "DocumentEmbedding":
return cls(
document=Document(**d["document"]),
text_embedding=d["text_embedding"]
)
def get_or_create_embeddings(
embedding_function: EmbeddingFunction,
chunks: List[Document],
embedding_name: str,
) -> List[DocumentEmbedding]:
embeddings_file = embeddings_folder / f"{embedding_name}_embeddings.json"
if embeddings_file.exists():
with open(embeddings_file, "r") as file:
embeddings = [DocumentEmbedding.from_dict(embedding) for embedding in json.load(file)]
print(f"Loaded {len(embeddings)} embeddings from {embeddings_file}")
else:
embeddings = []
for chunk in tqdm(chunks):
text_embedding = embedding_function([chunk.page_content])[0]
embedding = DocumentEmbedding(
document=chunk,
text_embedding=text_embedding
)
embeddings.append(embedding)
with open(embeddings_file, "w") as file:
json.dump([embedding.to_dict() for embedding in embeddings], file, indent=4)
print(f"Saved {len(embeddings)} embeddings to {embeddings_file}")
return embeddings
embeddings = {}
for embedding_name, embedding_function in chroma_embedding_functions.items():
for splitter_name, splitter_chunks in chunks.items():
embeddings[f"{embedding_name}_{splitter_name}"] = get_or_create_embeddings(
embedding_function, splitter_chunks, f"{embedding_name}_{splitter_name}"
)
Loaded 25399 embeddings from data\silver\embeddings\mini_recursive_256_embeddings.json Loaded 5754 embeddings from data\silver\embeddings\mini_recursive_1024_embeddings.json Loaded 3144 embeddings from data\silver\embeddings\mini_semantic_embeddings.json Loaded 25399 embeddings from data\silver\embeddings\bge-m3_recursive_256_embeddings.json Loaded 5754 embeddings from data\silver\embeddings\bge-m3_recursive_1024_embeddings.json Loaded 3144 embeddings from data\silver\embeddings\bge-m3_semantic_embeddings.json Loaded 25399 embeddings from data\silver\embeddings\gte_recursive_256_embeddings.json Loaded 5754 embeddings from data\silver\embeddings\gte_recursive_1024_embeddings.json Loaded 3144 embeddings from data\silver\embeddings\gte_semantic_embeddings.json
The number of embeddings relates to the number of chunks produced by the individual chunking strategies, not the embedding dimensions. Thus smaller chunk size (e.g. 256) yields more chunks than larger chunk size (1024), and semantic embeddings even less chunks.
As mentioned above for our semantic search retrieval we will be storing the embeddings in ChromaDB. ChromaDB is a powerful tool for indexing and searching high-dimensional data. It is allows e.g. to use approximate nearest neighbor (ANN) search based on the Hierarchical Navigable Small World (HNSW) algorithm, which is known for its efficiency in searching high-dimensional spaces.
Just like with normal SQL databases we have a server, in this case an SQLite server, that we can connect to with a client. We will then use the client to connect to the server and create for each set of embeddings a new seperate database which can be thought of as the index or a vector space. ChromaDB calls these separate vector spaces "collections". These collections will then be used to search for the most relevant chunks to a user query.

gold_folder = data_folder / "gold"
if not gold_folder.exists():
gold_folder.mkdir()
chromadb_folder = gold_folder / "chromadb"
if not chromadb_folder.exists():
chromadb_folder.mkdir()
chroma_client = chromadb.PersistentClient(path=chromadb_folder.as_posix())
Again we can make use of preprocessed data as before to speed up the preparatory steps.
def get_or_create_collection(
name: str,
embedding_function: EmbeddingFunction,
embeddings: List[DocumentEmbedding],
batch_size: int = 128
) -> Collection:
collection = chroma_client.get_or_create_collection(
name=name,
# configure to use cosine distance not default L2
metadata={"hnsw:space": "cosine"},
embedding_function=embedding_function
)
if collection.count() == 0:
for i in tqdm(range(0, len(embeddings), batch_size)):
batch = embeddings[i:i+batch_size]
collection.add(
documents=[embedding.document.page_content for embedding in batch],
embeddings=[embedding.text_embedding for embedding in batch],
ids=[str(embedding.document.metadata["id"]) for embedding in batch],
metadatas=[embedding.document.metadata for embedding in batch]
)
return collection
collections = {}
for collection_name, current_embeddings in embeddings.items():
collection = get_or_create_collection(
collection_name,
chroma_embedding_functions[collection_name.split("_")[0]],
current_embeddings
)
collections[collection_name] = collection
print(f"Collection {collection_name} has {collection.count()} documents")
Collection mini_recursive_256 has 25399 documents Collection mini_recursive_1024 has 5754 documents Collection mini_semantic has 3144 documents Collection bge-m3_recursive_256 has 25399 documents Collection bge-m3_recursive_1024 has 5754 documents Collection bge-m3_semantic has 3144 documents Collection gte_recursive_256 has 25399 documents Collection gte_recursive_1024 has 5754 documents Collection gte_semantic has 3144 documents
The above printout shows the three embedding models applied to the three chunking strategies.
Once we have stored all the embeddings in ChromaDB we can test the retrieval process by querying one of our collections and see what the most similar chunks are. Try some different queries and see what the most similar chunks are and whether they make sense.
selected_collection = collections["gte_recursive_1024"]
results = selected_collection.query(
query_texts=["Climate Change"],
n_results=3,
)
for doc in results["documents"][0]:
print(wrap_text(doc))
print()
Climate Change Archives - Page 5 of 63: Southern countries are pushing hard to make transparent the wealth and climate consequences of burning fossil fuels. Bill McKibben says it's clear how impeachably... While I watched the chilled host on the Macyβ s Day Parade television broadcast talk about Tofurky as a vegan Thanksgiving substitute, I canβ t say... A turkey is a symbol of US Thanksgiving dinner traditions. But how do you make flexitarians -- guests who prefer vegetarian or vegan eating... For the first time ever, formal discussions took place at the annual climate convention about food security. The consensus is that, in order to... The new Chris Hemsworth project `` Limitless '' is the perfect antidote to climate doomerism ( with bonus energy storage angle, of course). Food security threatens many regions around the world. Puerto Rico's decades of dependence on outside food imports has impacted the health and resilience of... Engineers working on hydrogen, evtols, UAM, vertiports, hypersonic passenger scenario used in the study is unlikely because of global efforts to limit greenhouse gas emissions, the findings reveal a previously unknown tipping point that if activated would release an important brake on global warming, the authors said. `` We need to think about these worst-case scenarios to understand how our CO2 emissions might affect the oceans not just this century, but next century and the following century, '' said Megumi Chikamoto, who led the research as a research fellow at the University of Texas Institute for Geophysics. The study was published in the journal Geophysical Research Letters. Today, the oceans soak up about a third of the CO2 emissions generated by humans. Climate simulations had previously shown that the oceans slow their absorption of CO2 over time, but none had considered alkalinity as explanation. To reach their conclusion, the researchers recalculated pieces of a 450-year simulation until they hit on alkalinity as a key cause of the slowing. According to the findings, the Potential Climatic Impact of Nord Stream Methane Leaks: Nord Stream 1 and 2, two subsea pipelines that transport natural gas from Russia to Germany, were both intentionally destroyed on September 26th, 2022. Enormous amounts of gases, mainly methane, were discharged into the ocean and eventually into the atmosphere. Methane escaping from sabotaged pipelines in the Baltic Sea ( September 27th, 2022). Image Credit: Danish Armed Forces Methane is the second most prevalent anthropogenic greenhouse gas after CO2, although its greenhouse effect is substantially stronger. As a result, whether this catastrophe may have detrimental climatic consequences is a major issue around the world. This problem was discussed in a news article published in Nature, but no quantitative implications were reached. Recently, scientists from the Chinese Academy of Sciencesβ Institute of Atmospheric Physics approximated the potential climatic effect of leaked methane using the energy-conservation framework of the Intergovernmental
To gain a better understandign of how the retrieval process works we will analyze the embedding space. We will start by projecting the embeddings into a 2D space using UMAP. UMAP is a dimensionality reduction technique that is particularly well-suited for visualizing high-dimensional data in a lower-dimensional space. The most notable advantages over other dimensionality reduction techniques are increased speed and better preservation of the data's global structure. We will then use the UMAP embeddings to create a scatter plot of the chunks.
def get_vectors_from_collection(collection: Collection):
stored_chunks = collection.get(include=["documents", "metadatas", "embeddings"])
return np.array(stored_chunks["embeddings"])
def get_vectors_by_domain(collection: Collection, domain: str):
stored_chunks = collection.get(include=["documents", "metadatas", "embeddings"])
metadatas = stored_chunks["metadatas"]
indices = [str(metadata["id"]) for metadata in metadatas if metadata["domain"] == domain]
return collection.get(include=["embeddings"], ids=indices)["embeddings"]
def fit_umap(vectors: np.ndarray):
return umap.UMAP().fit(vectors)
def project_embeddings(embeddings, umap_transform):
return umap_transform.transform(embeddings)
vectors = get_vectors_from_collection(selected_collection)
print(f"Original shape: {vectors.shape}")
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
print(f"Projected shape: {vectors_projections.shape}")
Original shape: (5754, 768) Projected shape: (5754, 2)
The dimensions above show how the chunked embeddings with 768 dimensions are reduced to two dimensions for visualization purposes.
You can zoom in the plot by clicking and dragging a box around the area you want to zoom in on. You can also reset the plot by double clicking on the plot.
fig = px.scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1])
fig.show()
Next we will color the embeddings by the domain of the article to see if there are any patterns or clusters in the embedding space based on the domain.
fig = go.Figure()
for domain in sample_df["domain"].unique():
domain_vectors = get_vectors_by_domain(selected_collection, domain)
domain_projections = project_embeddings(domain_vectors, umap_transform)
fig.add_trace(go.Scatter(x=domain_projections[:, 0], y=domain_projections[:, 1], mode='markers', marker=dict(size=4), name=domain))
fig.show()
We can also visualize the retrieval process by plotting the query and the most similar chunks in the embedding space. This will give us a better understanding of how the retrieval process works and how the most similar chunks are found.
Note that the UMAP projection uses a metric approach which differs from the approximate nearest neighbor approach used for retrieval. Also don't forget that the embeddings are in a high-dimensional space and we are only visualizing a 2D projection of them so the distances between the points might not be accurate. Try some different queries and see how the most similar chunks are found.
def plot_retrieval_results(
query: str,
selected_collection: Collection,
n_results: int = 5
):
vectors = get_vectors_from_collection(selected_collection)
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
query_embedding = selected_collection._embedding_function([query])[0]
query_embedding = np.array(query_embedding).reshape(1, -1)
query_projection = project_embeddings(query_embedding, umap_transform)
nearest_neighbors = selected_collection.query(
query_texts=[query],
n_results=n_results,
)
neighbor_vectors = selected_collection.get(include=["embeddings"], ids=nearest_neighbors["ids"][0])["embeddings"]
neighbor_projections = project_embeddings(neighbor_vectors, umap_transform)
fig = go.Figure()
fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
fig.add_trace(go.Scatter(x=neighbor_projections[:, 0], y=neighbor_projections[:, 1], mode='markers', marker=dict(size=5, color='orange'), name="nearest neighbors"))
fig.add_trace(go.Scatter(x=query_projection[:, 0], y=query_projection[:, 1], mode='markers', marker=dict(size=10, color='red', symbol='x'), name="query"))
fig.show()
plot_retrieval_results(
"Climate Change",
selected_collection,
)
Lastly we will analyze the distribution of the cosine distances between the query and the different chunks. This will give us a better understanding of the cosine distance and show that the distances in the high-dimensional space are not the same as in the 2D projection. Do not confuse the cosine distance with the cosine similarity. The cosine similarity is the cosine of the angle between two vectors and the cosine distance is 1 minus the cosine similarity so that smaller numbers mean the vectors are more similar.
def cosine_distance(vector1, vector2):
dot_product = np.dot(vector1, vector2.T)
norm_product = np.linalg.norm(vector1) * np.linalg.norm(vector2)
similarity = dot_product / norm_product
return 1 - similarity
def plot_cosine_distances(
query: str,
selected_collection: Collection
):
vectors = get_vectors_from_collection(selected_collection)
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
query_embedding = selected_collection._embedding_function([query])[0]
query_embedding = np.array(query_embedding).reshape(1, -1)
query_projection = project_embeddings(query_embedding, umap_transform)
similarities = np.array([cosine_distance(query_embedding, vector) for vector in vectors])
fig = go.Figure()
fig.add_trace(go.Scatter(
x=vectors_projections[:, 0],
y=vectors_projections[:, 1],
mode='markers',
marker=dict(
size=5,
color=similarities.flatten(),
colorscale='RdBu',
colorbar=dict(title='Cosine Distance')
),
text=['Cosine Distance: {:.4f}'.format(
sim) for sim in similarities.flatten()],
name='Other Vectors'
))
fig.add_trace(go.Scatter(x=[query_projection[0][0]], y=[
query_projection[0][1]], mode='markers', marker=dict(size=10, color='black', symbol='x'), text=['Query Vector'], name='Query Vector'))
fig.show()
plot_cosine_distances(
"Climate Change",
selected_collection,
)
Now that we have generated the embeddings and stored them in ChromaDB we can put it all together and create the RAG pipeline. The RAG pipeline consists of the following steps:
In this notebook we will be using Langchain to build up our pipeline. You do not need a library like Langchain or LlamaIndex to build a RAG pipeline, but it can make the process easier.
The idea of Langchain and its LCEL (Langchain Expression Language) is very simple. Within the pipeline there are lots of steps that take an input and produce an output. These steps can be chained together to form a pipeline. The LCEL is a simple language that allows you to define these steps and how they are connected. For more technical details on how Langchain works check out the Langchain Documentation.
In simple terms langchain provides an abstraction of a step that has an invoke method that takes an input, a dictionary of parameters and returns an output also a dictionary. This allows you to chain together different steps and define how they are connected and also split of chains of steps into separate pipelines.
Below you can see an overview of our RAG pipeline:

And now let's look at the implementation of the RAG pipeline.
def create_qa_chain(retriever: BaseRetriever):
template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know. Keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""
rag_prompt = ChatPromptTemplate.from_template(template)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = RunnableParallel(
{
"context": retriever,
"question": RunnablePassthrough()
}
).assign(answer=(
RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
| rag_prompt
| llm
| StrOutputParser()
))
return rag_chain
For Langchain to work with our ChromaDB collections we need to transform the collections into a format that Langchain can work with so called stores and retrievers.
def collection_to_store(collection_name: str, lc_embedding_model: EmbeddingFunction):
return Chroma(
client=chroma_client,
collection_name=collection_name,
embedding_function=lc_embedding_model,
)
def store_to_retriever(store: VectorStore, k: int = 3):
retriever = store.as_retriever(
search_type="similarity", search_kwargs={'k': k}
)
return retriever
selected_store = collection_to_store("gte_recursive_1024", embedding_models["gte"])
selected_retriever = store_to_retriever(selected_store)
selected_retriever.invoke("Climate Change")
[Document(page_content="Climate Change Archives - Page 5 of 63: Southern countries are pushing hard to make transparent the wealth and climate consequences of burning fossil fuels. Bill McKibben says it's clear how impeachably... While I watched the chilled host on the Macyβ s Day Parade television broadcast talk about Tofurky as a vegan Thanksgiving substitute, I canβ t say... A turkey is a symbol of US Thanksgiving dinner traditions. But how do you make flexitarians -- guests who prefer vegetarian or vegan eating... For the first time ever, formal discussions took place at the annual climate convention about food security. The consensus is that, in order to... The new Chris Hemsworth project `` Limitless '' is the perfect antidote to climate doomerism ( with bonus energy storage angle, of course). Food security threatens many regions around the world. Puerto Rico's decades of dependence on outside food imports has impacted the health and resilience of... Engineers working on hydrogen, evtols, UAM, vertiports, hypersonic passenger", metadata={'domain': 'cleantechnica', 'id': 2262, 'title': 'Climate Change Archives - Page 5 of 63', 'url': 'cleantechnica.com/tag/climate-change/page/5'}),
Document(page_content="scenario used in the study is unlikely because of global efforts to limit greenhouse gas emissions, the findings reveal a previously unknown tipping point that if activated would release an important brake on global warming, the authors said. `` We need to think about these worst-case scenarios to understand how our CO2 emissions might affect the oceans not just this century, but next century and the following century, '' said Megumi Chikamoto, who led the research as a research fellow at the University of Texas Institute for Geophysics. The study was published in the journal Geophysical Research Letters. Today, the oceans soak up about a third of the CO2 emissions generated by humans. Climate simulations had previously shown that the oceans slow their absorption of CO2 over time, but none had considered alkalinity as explanation. To reach their conclusion, the researchers recalculated pieces of a 450-year simulation until they hit on alkalinity as a key cause of the slowing. According to the findings, the", metadata={'domain': 'azocleantech', 'id': 482, 'title': 'Global Warming Could Trigger Chemical Changes in the Ocean Surface that Accelerate Climate Change', 'url': 'azocleantech.com/news.aspx?newsID=33053'}),
Document(page_content='Potential Climatic Impact of Nord Stream Methane Leaks: Nord Stream 1 and 2, two subsea pipelines that transport natural gas from Russia to Germany, were both intentionally destroyed on September 26th, 2022. Enormous amounts of gases, mainly methane, were discharged into the ocean and eventually into the atmosphere. Methane escaping from sabotaged pipelines in the Baltic Sea ( September 27th, 2022). Image Credit: Danish Armed Forces Methane is the second most prevalent anthropogenic greenhouse gas after CO2, although its greenhouse effect is substantially stronger. As a result, whether this catastrophe may have detrimental climatic consequences is a major issue around the world. This problem was discussed in a news article published in Nature, but no quantitative implications were reached. Recently, scientists from the Chinese Academy of Sciencesβ Institute of Atmospheric Physics approximated the potential climatic effect of leaked methane using the energy-conservation framework of the Intergovernmental', metadata={'domain': 'azocleantech', 'id': 463, 'title': 'Potential Climatic Impact of Nord Stream Methane Leaks', 'url': 'azocleantech.com/news.aspx?newsID=32568'})]
Now that we have our retriever we can create our RAG pipeline. Try some different queries and see how the pipeline responds.
selected_chain = create_qa_chain(selected_retriever)
selected_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(page_content='Blue River, Vida, Phoenix, and Talentβwere lost to the so-called Labor Day Fires in 2020. And in 2021, the Lytton Creek Fire wiped out the village of Lytton, British Columbia, destroying hundreds of homes. All told, between 2017 and 2021, nearly 120,000 fires burned across western North America, burning nearly 39 million acres of land and claiming more than 60,000 structures. The impacts of wildfires reach well beyond the people, communities, and ecosystems that are directly affected by flames. Wildfires have consequences for public health, water supplies, and economies long after a fire is extinguished. Mounting research is showing exposure to the fine particulate matter in wildfire smoke is responsible for thousands of indirect deaths, increases to the risk of pre-term birth among pregnant women, and even an increase in the risk of COVID-19 illness and death. Surprisingly, some of the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In', metadata={'domain': 'cleantechnica', 'id': 5660, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In addition to their impact on air quality, wildfires can disrupt processes that maintain access to drinking water, including by reducing the ability of soil to absorb water when it rains and sending additional sediment into drinking water systems. The science is clear that climate change is increasing whatβ s known as the β vapor pressure deficit, β or VPD, across western North America. When VPD is high, the atmosphere can pull more water out of plants, which dries them out and makes them more likely to burn. VPD also is also a good metric for drought, including the long 21st century drought the region has been experiencing. But climate isnβ t the only factor behind the westβ s worsening wildfires. More than a century of aggressive fire suppression and an even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too', metadata={'domain': 'cleantechnica', 'id': 5661, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too uniform in their species, and without the resistance to fire they once had. Moreover, a lack of affordable housing across the region and the desire for proximity to beautiful, natural places has led to large increases in the number of people living in wildfire-prone areas. Recent research has found that human activities were responsible for starting more than 80% of all wildfires in the United States, while also increasing the length of the fire season. This yearβ s wildfire season may offer the western US a chance to catch its breath after several years of record-breaking fires. But with climate change expected to deepen the hot, dry conditions that enable such record-breaking fires, we must be preparing for a future with even more fire. Carly Phillips contributed to this post. We publish a number of guest posts from experts in a large variety of fields. This is our contributor', metadata={'domain': 'cleantechnica', 'id': 5662, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'})],
'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
'answer': 'The biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas.'}
chains = {}
for collection_name, collection in collections.items():
store = collection_to_store(collection_name, embedding_models[collection_name.split("_")[0]])
retriever = store_to_retriever(store)
chain = create_qa_chain(retriever)
chains[collection_name] = chain
chains.keys()
dict_keys(['mini_recursive_256', 'mini_recursive_1024', 'mini_semantic', 'bge-m3_recursive_256', 'bge-m3_recursive_1024', 'bge-m3_semantic', 'gte_recursive_256', 'gte_recursive_1024', 'gte_semantic'])
Because we have many hyperparameters such as chunk size, prompts etc. to tune and different strategies to try we will use the RAGAS (RAG Assesment) framework to evaluate our pipeline. RAGAS is a framework that allows you to evaluate your RAG pipeline with an LLM as a judge and other metrics that also utilize embedding models. We will go more into detail on the metrics later on.
Before we can start the evaluation we need to define the evaluation questions and their ground truth answers. For this we will use the provided evaluation questions. To increase our question pool we will also generate some additional question and answer pairs based on a random chunk and utilizing the LLM (GPT-4o) to generate the question and answer.
human_eval_df.head()
| question | relevant_section | url | |
|---|---|---|---|
| example_id | |||
| 1 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | sgvoice.energyvoice.com/strategy/technology/23... |
| 2 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | sgvoice.energyvoice.com/policy/25396/eu-seeks-... |
| 3 | What is the EUβs Green Deal Industrial Plan? | The European counterpart to the US Inflation R... | pv-magazine.com/2023/02/02/european-commission... |
| 4 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | sgvoice.energyvoice.com/policy/25396/eu-seeks-... |
| 5 | When did the cooperation between GM and Honda ... | What caught our eye was a new hookup between G... | cleantechnica.com/2023/05/08/general-motors-se... |
As we are only given questions and the relevant sections of the articles we need to generate the answers to the questions. We will use the LLM (GPT-4o) to generate the answers to the questions.
def generate_eval_answers(df: pd.DataFrame) -> pd.DataFrame:
answer_geneation_prompt = """Answer the following question based on the article:
Question: {question}
Article: {article}
"""
answer_generation_chain = ChatPromptTemplate.from_template(answer_geneation_prompt) | llm
for i, row in tqdm(df.iterrows(), total=len(df)):
df.at[i, "ground_truth"] = answer_generation_chain.invoke({"question": row["question"], "article": row["relevant_section"]}).content
return df
if (silver_folder / "human_eval.csv").exists():
human_eval_df = pd.read_csv(silver_folder / "human_eval.csv")
else:
human_eval_df = generate_eval_answers(human_eval_df)
human_eval_df.to_csv(silver_folder / "human_eval.csv", index=False)
human_eval_df.head()
| question | relevant_section | url | ground_truth | |
|---|---|---|---|---|
| 0 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | sgvoice.energyvoice.com/strategy/technology/23... | The innovation behind LeclanchΓ©'s new method t... |
| 1 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | sgvoice.energyvoice.com/policy/25396/eu-seeks-... | The EUβs Green Deal Industrial Plan is an init... |
| 2 | What is the EUβs Green Deal Industrial Plan? | The European counterpart to the US Inflation R... | pv-magazine.com/2023/02/02/european-commission... | The EUβs Green Deal Industrial Plan is aimed a... |
| 3 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | sgvoice.energyvoice.com/policy/25396/eu-seeks-... | The four focus areas of the EU's Green Deal In... |
| 4 | When did the cooperation between GM and Honda ... | What caught our eye was a new hookup between G... | cleantechnica.com/2023/05/08/general-motors-se... | The cooperation between GM and Honda on fuel c... |
We will now generate some synthetic questions and answers based on random chunks. So we will give the LLM a random chunk and ask it to generate a question and answer based on the chunk.
def generate_synthetic_qa_pairs(documents: List[Document], n: int = 10) -> List[str]:
synthetic_questions = []
documents = np.random.choice(documents, n)
question_generation_prompt = """Generate a short and general question based on the following news article:
Article: {article}
"""
question_generation_chain = ChatPromptTemplate.from_template(question_generation_prompt) | llm
answer_geneation_prompt = """Answer the following question based on the article:
Question: {question}
Article: {article}
"""
answer_generation_chain = ChatPromptTemplate.from_template(answer_geneation_prompt) | llm
for document in tqdm(documents):
element = {}
content = document.page_content
element["relevant_section"] = content
element["url"] = document.metadata["url"]
question = question_generation_chain.invoke({"article": content}).content
element["question"] = question
answer = answer_generation_chain.invoke({"question": question, "article": content}).content
element["ground_truth"] = answer
synthetic_questions.append(element)
return pd.DataFrame(synthetic_questions)
if not (silver_folder / "synthetic_eval.csv").exists():
synthetic_eval_df = generate_synthetic_qa_pairs(chunks["recursive_1024"], 25)
synthetic_eval_df.to_csv(silver_folder / "synthetic_eval.csv", index=False)
else:
synthetic_eval_df = pd.read_csv(silver_folder / "synthetic_eval.csv", index_col=0)
synthetic_eval_df.head()
| url | question | ground_truth | |
|---|---|---|---|
| relevant_section | |||
| for hybrid and fully battery electric vehicles, aiming to bring the industry closer to achieving the key tipping points for mainstream electric vehicle ( EV) adoption. Castrol has been working on advanced EV fluids designed to manage temperatures within Li-ON cells to enable ultra-fast charging and better efficiency. Meanwhile, in its so-called Big Battery Challenge, the UKβ s Institute of Mechanical Engineering ( IMechE) experts have determined that, while it is likely the Li-ON battery will dominate for the time being, β there are plenty of potential long-term challengers β. Three contenders are especially identified: sodium ion ( Na-ON), solid state and Lithium-sulphur ( Li-S). Sodium-ion batteries are regarded as an emerging technology with β promising cost, safety, sustainability and performance advantages β over commercialised lithium-ion batteries. According to IMechE material: β Key advantages include the use of widely available and inexpensive raw materials and a rapidly scaleable technology based | energyvoice.com/technology/446761/batteries-te... | What are the potential long-term challengers t... | The potential long-term challengers to lithium... |
| and content of the checkpoint. Using the feedback, the checkpoint will then be established as a new measure to assess potential future licences. It will ensure any future licences are granted on the basis that they are compatible with the UKβ s goal to become net zero by 2050. If the evidence suggests that a future licensing round would undermine progress towards that target, it would not go ahead, UK Government said. The new checkpoint will add an additional layer of scrutiny to future licences, on top of the existing measures that already apply to UK oil and gas developments. Operators currently have to adhere to regulations enforced by the Offshore Petroleum Regulator for Environment and Decommission ( OPRED), as well as the net zero impact assessment carried out by the OGA as part of its consent process for new licences. Malcolm Offord, UK Government minister for Scotland, said: β The UK Government fully supports the oil and gas industry in its transition away from fossil fuels to cleaner, greener energy | energyvoice.com/oilandgas/north-sea/374073/uk-... | How will the new checkpoint affect the process... | The new checkpoint will affect the process of ... |
| France ( 103 days) and the Netherlands ( 123 days). Centrica said it had completed β significant engineering upgrades β over the summer and in August was given the go-ahead by the offshore regulator North Sea Transition Authority ( NSTA) to reopen the site. This was followed by commissioning over the early autumn, enabling it to make its first injection of gas into the site in over five years. The work done so far means that Rough is operating at around 20% of its previous capacity this winter, immediately making it the UKβ s largest gas storage site once again and adding 50% to the UKβ s gas storage volume. The operator now says its long-term aim is to turn the Rough gas field into β the largest long duration energy storage facility in Europe β, capable of storing both natural gas and hydrogen β a major turnaround in fortunes for the previously mothballed site. Centrica group chief executive Chris Oβ Shea said: β Iβ m delighted that we have managed to return Rough to storage operations for this winter | energyvoice.com/oilandgas/north-sea/455701/cen... | What are the implications of reopening the Rou... | Reopening the Rough gas field has significant ... |
| Going underground: how solar sites can boost biodiversity: The UKβ s biodiversity crisis stands in the shadow of our energy price crisis β but both challenges can be addressed through renewable energy. Mark Rowcroft, Development Director at solar and battery storage developer Exagen, explains how reaching our full solar energy potential means looking not only to the skies, but to the soil. Boosting UK renewable energy is a key route to tackling the energy price crisis, with solar power the cheapest form of electricity today. The Prime Ministerβ s COP27 speech reaffirmed his commitment to clean energy and, with the UK targeting 70GW of solar generation by 2035, huge potential exists to grow solar generation. Yet misconceptions stubbornly remain: such as the argument frequently presented by opponents of solar farms that they β industrialise the land β, without realising the extent to which UK farmland is already industrialised. A common misconception is that UK farmland is bursting with wildlife and | sgvoice.energyvoice.com/policy/18796/going-und... | How can solar energy sites contribute to addre... | Solar energy sites can contribute to addressin... |
| solar manufacturer Kaneka as a supplier for solar cell deployment in one of its electric vehicles. Kaneka's solar cels have been for years recognized as the most efficient crystalline silicon PV device developed at both the industry and research levels. However, Chinese manufacturer Longi said last November that it had crossed reached a power conversion efficiency of 26.81% with an unspecified heterojunction ( HJT) solar cell, based on a full-size silicon wafer, in mass production. This content is protected by copyright and may not be reused. If you want to cooperate with us and would like to reuse some of our content, please contact: editors @ pv-magazine.com. Please be mindful of our community standards. Your email address will not be published. Required fields are marked * Save my name, email, and website in this browser for the next time I comment. By submitting this form you agree to pv magazine using your data for the purposes of publishing your comment. Your personal data will only be disclosed or | pv-magazine.com/2023/07/03/enecoat-toyota-deve... | What advancements have been made in solar cell... | Companies like Kaneka and Longi have made sign... |
question_length = {
"human": human_eval_df["question"].map(len),
"synthetic": synthetic_eval_df["question"].map(len)
}
sns.histplot(question_length, kde=True)
plt.title("Question Length Distribution")
plt.xlabel("Question Length")
plt.ylabel("Count")
plt.show()
eval_df = pd.concat([human_eval_df, synthetic_eval_df], ignore_index=True)
eval_df["is_synthetic"] = eval_df["relevant_section"].isna()
eval_df["is_synthetic"].value_counts()
is_synthetic True 25 False 23 Name: count, dtype: int64
Now we have doubled the number of questions and answers. However, we can see that our synthetic questions are slightly longer than the provided questions which could mean that they are slightly easier to answer. This potential bias should be taken into account when evaluating the pipeline.
RAGAS provides a variety of metrics to evaluate the performance of a RAG pipeline. Here are some of the key metrics we will be using and how they are calculated:

For this to work we create a test dataset for each of our RAG pipelines that contains the evaluation questions and their ground truth answers. We then run all the questions through our RAG pipeline and store the generated answers and the retrieved chunks. We can then use this test dataset to calculate the RAGAS metrics.
datasets_folder = gold_folder / "datasets"
if not datasets_folder.exists():
datasets_folder.mkdir()
def get_or_create_eval_dataset(name: str, df: pd.DataFrame, chain: Chain) -> Dataset:
dataset_file = datasets_folder/ f"{name}_dataset.json"
if dataset_file.exists():
with open(dataset_file, "r") as file:
dataset = Dataset.from_dict(json.load(file))
print(f"Loaded {name} dataset from {dataset_file}")
else:
datapoints = {
"question": df["question"].tolist(),
"answer": [],
"contexts": [],
"ground_truth": df["ground_truth"].tolist(),
"context_urls": []
}
for question in tqdm(datapoints["question"]):
result = chain.invoke(question)
datapoints["answer"].append(result["answer"])
datapoints["contexts"].append([str(doc.page_content) for doc in result["context"]])
datapoints["context_urls"].append([doc.metadata["url"] for doc in result["context"]])
dataset = Dataset.from_dict(datapoints)
with open(dataset_file, "w") as file:
json.dump(dataset.to_dict(), file)
print(f"Saved {name} dataset to {dataset_file}")
return dataset
results_folder = gold_folder / "results"
if not results_folder.exists():
results_folder.mkdir()
def get_or_run_llm_eval(name: str, dataset: Dataset, llm_judge_model: LLM) -> pd.DataFrame:
eval_results_file = results_folder / f"{name}_llm_eval_results.csv"
if eval_results_file.exists():
eval_results = pd.read_csv(eval_results_file)
print(f"Loaded {name} evaluation results from {eval_results_file}")
else:
eval_results = evaluate(dataset,
metrics=[faithfulness, answer_relevancy, context_relevancy, answer_correctness],
is_async=True,
llm=llm_judge_model,
embeddings=embedding_models["gte"],
run_config=RunConfig(
timeout=60, max_retries=10, max_wait=60, max_workers=8),
).to_pandas()
eval_results.to_csv(eval_results_file, index=False)
print(f"Saved {name} evaluation results to {eval_results_file}")
return eval_results
def plot_llm_eval(name: str, eval_results: pd.DataFrame):
# select only the float64 columns (assuming these are the RAGAS metrics)
ragas_metrics_data = (eval_results
.select_dtypes(include=[np.float64]))
# boxplot of distributions
sns.boxplot(data=ragas_metrics_data, palette="Set2")
plt.title(f'{name}: Distribution of RAGAS Evaluation Metrics')
plt.ylabel('Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# barplot of means
means = ragas_metrics_data.mean()
plt.figure(figsize=(14, 8))
sns.barplot(x=means.index, y=means, palette="Set2")
plt.title(f'{name}: Mean of RAGAS Evaluation Metrics')
plt.ylabel('Mean Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
def plot_multiple_evals(eval_results: Dict[str, pd.DataFrame]):
# combine the results
full_results = []
for name, results in eval_results.items():
results['name'] = name
full_results.append(results)
full_results = pd.concat(full_results, ignore_index=True)
full_results = full_results.sort_values(by='name')
# select only the float64 columns (assuming these are the RAGAS metrics)
ragas_metrics_data = full_results.select_dtypes(include=[np.float64])
ragas_metrics_data['name'] = full_results['name']
# boxplot of distributions
plt.figure(figsize=(14, 8))
sns.boxplot(x='variable', y='value', hue='name', data=pd.melt(ragas_metrics_data, id_vars='name'), palette="Set2")
plt.title('Distribution of RAGAS Evaluation Metrics by Model')
plt.ylabel('Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.legend(title='Model')
plt.tight_layout()
plt.show()
# barplot of means
means = ragas_metrics_data.groupby('name').mean().reset_index()
means_melted = pd.melt(means, id_vars='name')
plt.figure(figsize=(14, 8))
sns.barplot(x='variable', y='value', hue='name', data=means_melted, palette="Set2")
plt.title('Mean of RAGAS Evaluation Metrics by Model')
plt.ylabel('Mean Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.legend(title='Model')
plt.tight_layout()
plt.show()
selected_dataset = get_or_create_eval_dataset("selected", eval_df, selected_chain)
Loaded selected dataset from data\gold\datasets\selected_dataset.json
As a judge we use the GPT-4o-mini model. This model is a smaller version of the GPT-4o model. Whilst it is not as powerful as the full GPT-4o model it is still a very powerful model and can be used to evaluate the performance of our RAG pipeline without having to high costs.
It has also been suggest in Literature that when evaluating LLMs with LLMS as judges the evaluation is more reliable when the judge a different model than the model being evaluated. This is because the models might have learned to exploit the weaknesses of the other model or have a certain bias to there own answers. https://arxiv.org/abs/2404.13076
judge = ChatOpenAI(model="gpt-4o-mini")
question_prompt = ChatPromptTemplate.from_template(
"Answer the following question: {question}")
question_chain = question_prompt | judge | StrOutputParser()
question_chain.invoke({"question": "What is the meaning of life?"})
'The meaning of life is a philosophical question that has been contemplated by humans for centuries. Different cultures, religions, and individuals have offered various interpretations. Some people find meaning through relationships, love, and connection with others, while others seek purpose through personal achievements, spirituality, or contributing to the greater good.\n\nIn existential philosophy, the meaning of life is often seen as something that each person must define for themselves. This perspective emphasizes personal responsibility and the idea that individuals create their own meaning through their choices and actions. \n\nUltimately, the meaning of life can be deeply personal and subjective, varying greatly from one person to another. It may encompass a combination of experiences, beliefs, values, and aspirations that resonate with an individualβs understanding of their existence.'
selected_llm_eval_results = get_or_run_llm_eval("selected", selected_dataset, llm)
plot_llm_eval("selected", selected_llm_eval_results)
Loaded selected evaluation results from data\gold\results\selected_llm_eval_results.csv
datasets = {}
for name, chain in chains.items():
datasets[name] = get_or_create_eval_dataset(name, eval_df, chain)
Loaded mini_recursive_256 dataset from data\gold\datasets\mini_recursive_256_dataset.json Loaded mini_recursive_1024 dataset from data\gold\datasets\mini_recursive_1024_dataset.json Loaded mini_semantic dataset from data\gold\datasets\mini_semantic_dataset.json Loaded bge-m3_recursive_256 dataset from data\gold\datasets\bge-m3_recursive_256_dataset.json Loaded bge-m3_recursive_1024 dataset from data\gold\datasets\bge-m3_recursive_1024_dataset.json Loaded bge-m3_semantic dataset from data\gold\datasets\bge-m3_semantic_dataset.json Loaded gte_recursive_256 dataset from data\gold\datasets\gte_recursive_256_dataset.json Loaded gte_recursive_1024 dataset from data\gold\datasets\gte_recursive_1024_dataset.json Loaded gte_semantic dataset from data\gold\datasets\gte_semantic_dataset.json
llm_results = {}
for dataset_name, dataset in datasets.items():
llm_results[dataset_name] = get_or_run_llm_eval(dataset_name, dataset, llm)
Loaded mini_recursive_256 evaluation results from data\gold\results\mini_recursive_256_llm_eval_results.csv Loaded mini_recursive_1024 evaluation results from data\gold\results\mini_recursive_1024_llm_eval_results.csv Loaded mini_semantic evaluation results from data\gold\results\mini_semantic_llm_eval_results.csv Loaded bge-m3_recursive_256 evaluation results from data\gold\results\bge-m3_recursive_256_llm_eval_results.csv Loaded bge-m3_recursive_1024 evaluation results from data\gold\results\bge-m3_recursive_1024_llm_eval_results.csv Loaded bge-m3_semantic evaluation results from data\gold\results\bge-m3_semantic_llm_eval_results.csv Loaded gte_recursive_256 evaluation results from data\gold\results\gte_recursive_256_llm_eval_results.csv Loaded gte_recursive_1024 evaluation results from data\gold\results\gte_recursive_1024_llm_eval_results.csv Loaded gte_semantic evaluation results from data\gold\results\gte_semantic_llm_eval_results.csv
plot_multiple_evals(llm_results)
mean_scores = {}
for name, results in llm_results.items():
mean_scores[name] = results.select_dtypes(include=[np.float64]).mean()
total_mean_scores = pd.DataFrame(mean_scores).mean()
total_mean_scores.sort_values(ascending=False)
bge-m3_recursive_1024 0.648580 gte_recursive_1024 0.647773 bge-m3_semantic 0.626282 gte_recursive_256 0.624573 mini_recursive_1024 0.622090 gte_semantic 0.605386 mini_semantic 0.601130 bge-m3_recursive_256 0.588193 mini_recursive_256 0.558627 dtype: float64
From the evaluation we can see that the RAG pipeline using the GTE embedding model by alibaba or BGE-M3 model along with recursive chunking with a chunk size of 1024 have on average across the metrics the best performance. This is likely due to the fact that these embedding models are the most powerful and the recursive chunking with a chunk size of 1024 provides enough context to the LLM but not too much that it gets distracted.
best_collection = collections["gte_recursive_1024"]
best_store = collection_to_store("gte_recursive_1024", embedding_models["gte"])
In this final section we will look at some more advanced methods to improve our RAG pipeline and comparing them to our best performing pipeline.
Multi-querying is a technique that involves querying the retrieval model with multiple questions to retrieve relevant chunks. This approach can enhance the retrieval process by leveraging the diversity of queries to capture a broader range of relevant information. By combining the results from multiple queries, we can potentially improve the quality of the retrieved chunks and, consequently, the generated responses. When creating these additional queries the goal is to create queries that are different from the original query but still relevant to the user's information need, i.e variations of the original query.

def generate_query_variations(query: str, num_additional_queries: int) -> List[str]:
multiquery_prompt = """You are an assistant tasked with generating {num_queries} \
different versions of the given user question to retrieve relevant documents from a vector \
database. By generating multiple perspectives on the user question and breaking it down, your goal is to help \
the user overcome some of the limitations of the distance-based similarity search. \
Provide these alternative questions separated by newlines without any numbering or listing.
Original question: {question}
Alternatives:
"""
multiquery_chain = ChatPromptTemplate.from_template(multiquery_prompt) | llm
return multiquery_chain.invoke({"question": query, "num_queries": num_additional_queries}).content.split("\n")
def plot_multiquery_retrieval_results(query: str, collection : Collection, num_additional_queries: int = 3, num_results: int = 3):
vectors = get_vectors_from_collection(collection)
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
query_projections = project_embeddings(collection._embedding_function([query]), umap_transform)
query_variations = generate_query_variations(query, 5)
query_variations_projections = project_embeddings(collection._embedding_function(query_variations), umap_transform)
original_relevant_docs = collection.query(
query_texts=[query],
n_results=num_results,
)
original_relevant_docs_ids = [item for sublist in original_relevant_docs["ids"] for item in sublist] # flatten
original_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=original_relevant_docs_ids)["embeddings"]
original_relevant_docs_projections = project_embeddings(original_relevant_docs_embeddings, umap_transform)
additional_relevant_docs = collection.query(
query_texts=query_variations,
n_results=num_results,
)
additional_relevant_docs_ids = [item for sublist in additional_relevant_docs["ids"] for item in sublist] # flatten
# remove duplicates
additional_relevant_docs_ids = list(set(additional_relevant_docs_ids))
# remove the original relevant docs from the additional relevant docs
additional_relevant_docs_ids = [doc_id for doc_id in additional_relevant_docs_ids if doc_id not in original_relevant_docs_ids]
additional_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=additional_relevant_docs_ids)["embeddings"]
additional_relevant_docs_projections = project_embeddings(additional_relevant_docs_embeddings, umap_transform)
fig = go.Figure()
fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
fig.add_trace(go.Scatter(x=query_projections[:, 0], y=query_projections[:, 1], mode='markers', marker=dict(size=7, color='black', symbol='x'), name="original query"))
fig.add_trace(go.Scatter(x=query_variations_projections[:, 0], y=query_variations_projections[:, 1], mode='markers', marker=dict(size=7, color='red', symbol='x'), name="query variations"))
fig.add_trace(go.Scatter(x=original_relevant_docs_projections[:, 0], y=original_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='orange'), name="original relevant docs"))
fig.add_trace(go.Scatter(x=additional_relevant_docs_projections[:, 0], y=additional_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='green'), name="additional relevant docs"))
fig.show()
plot_multiquery_retrieval_results("Climate Change", selected_collection)
class MultiQueryRetriever(BaseRetriever):
store: VectorStore
num_additional_queries: int = 3
num_results: int = 3
def _get_query_variations(self, query: str) -> List[str]:
return generate_query_variations(query, self.num_additional_queries)
def _get_relevant_documents(
self, original_query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
queries = self._get_query_variations(original_query)
queries.append(original_query)
retriever = store_to_retriever(self.store, k=self.num_results)
relevant_docs = []
for query in queries:
results = retriever.invoke(query, run_manager=run_manager)
# remove duplicates
for res in results:
if res not in relevant_docs:
relevant_docs.append(res)
return relevant_docs
multiquery_retriever = MultiQueryRetriever(store=best_store, num_additional_queries=3, num_results=3)
multiquery_chain = create_qa_chain(multiquery_retriever)
multiquery_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(page_content='the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In addition to their impact on air quality, wildfires can disrupt processes that maintain access to drinking water, including by reducing the ability of soil to absorb water when it rains and sending additional sediment into drinking water systems. The science is clear that climate change is increasing whatβ s known as the β vapor pressure deficit, β or VPD, across western North America. When VPD is high, the atmosphere can pull more water out of plants, which dries them out and makes them more likely to burn. VPD also is also a good metric for drought, including the long 21st century drought the region has been experiencing. But climate isnβ t the only factor behind the westβ s worsening wildfires. More than a century of aggressive fire suppression and an even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too', metadata={'domain': 'cleantechnica', 'id': 5661, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='Blue River, Vida, Phoenix, and Talentβwere lost to the so-called Labor Day Fires in 2020. And in 2021, the Lytton Creek Fire wiped out the village of Lytton, British Columbia, destroying hundreds of homes. All told, between 2017 and 2021, nearly 120,000 fires burned across western North America, burning nearly 39 million acres of land and claiming more than 60,000 structures. The impacts of wildfires reach well beyond the people, communities, and ecosystems that are directly affected by flames. Wildfires have consequences for public health, water supplies, and economies long after a fire is extinguished. Mounting research is showing exposure to the fine particulate matter in wildfire smoke is responsible for thousands of indirect deaths, increases to the risk of pre-term birth among pregnant women, and even an increase in the risk of COVID-19 illness and death. Surprisingly, some of the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In', metadata={'domain': 'cleantechnica', 'id': 5660, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='parts of Washington, Oregon, Idaho, and Nevada. Some scientists are also raising concerns that all the young grasses and other plants that have sprung up as a result of the wet weather could quickly turn into dry kindling for wildfires as the dry season wears on into late summer and fall. According to the latest wildland fire outlook, most of the western United States is expected to experience either normal or below-normal fire activity between May and August this year. Source: National Interagency Fire Center. There are many different ways to measure wildfire activity, but by almost any metric, wildfires across the western US and southwestern Canada are worsening. Reliable, consistent wildfire metrics across the region started to become available in the mid-1980s. Hereβ s what the trends show. From 1984 to 1999, the region experienced an average of roughly 230 fires per year. From 2000 to 2021, the average was more than 350 fires per year. The number of wildfires larger than 1,000 acres in western North', metadata={'domain': 'cleantechnica', 'id': 5655, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too uniform in their species, and without the resistance to fire they once had. Moreover, a lack of affordable housing across the region and the desire for proximity to beautiful, natural places has led to large increases in the number of people living in wildfire-prone areas. Recent research has found that human activities were responsible for starting more than 80% of all wildfires in the United States, while also increasing the length of the fire season. This yearβ s wildfire season may offer the western US a chance to catch its breath after several years of record-breaking fires. But with climate change expected to deepen the hot, dry conditions that enable such record-breaking fires, we must be preparing for a future with even more fire. Carly Phillips contributed to this post. We publish a number of guest posts from experts in a large variety of fields. This is our contributor', metadata={'domain': 'cleantechnica', 'id': 5662, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'})],
'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
'answer': 'The biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas.'}
datasets["multiquery"] = get_or_create_eval_dataset("multiquery", eval_df, multiquery_chain)
Loaded multiquery dataset from data\gold\datasets\multiquery_dataset.json
llm_results["multiquery"] = get_or_run_llm_eval("multiquery", datasets["multiquery"], llm)
Loaded multiquery evaluation results from data\gold\results\multiquery_llm_eval_results.csv
strategy_results = {}
strategy_results["gte_recursive_1024"] = llm_results["gte_recursive_1024"]
strategy_results["multiquery"] = llm_results["multiquery"]
plot_multiple_evals(strategy_results)
We can see that on average the answer correctness does slightly increase when using multi-querying. This is likely due to the fact that the retrieval process is more robust and can capture a broader range of relevant information. However, the faithfullness and context_relevancy decrease could be due to the multi-querying introducing more noise into the retrieval process by retrieving more chunks in general and some of them being less relevant.
The idea of the HyDE method is to generate hypothetical documents that are similar to the user query and then retrieve the most similar chunks to these hypothetical documents. This can be useful when the user query is not very specific or when the user query is not very similar to the chunks. The HyDE method can be used to generate hypothetical documents that are more similar to the chunks and therefore improve the retrieval process. Another way to think about it is generating a hypothetical answer and therby reaching an area in the embedding space that is more similar to the actual answer which might not be reachable from the user query.

def generate_hypothetical_document(query: str, num_hypotheses: int) -> List[str]:
hyde_prompt = """Please write a news passage about the topic.
Topic: {query}
Passage:
"""
hyde_chain = ChatPromptTemplate.from_template(hyde_prompt) | llm
hypothetical_documents = [hyde_chain.invoke({"query": query}).content for _ in range(num_hypotheses)]
return hypothetical_documents
def plot_hyde_retrieval_results(query: str, collection : Collection, num_hypo_documents: int = 2, num_results: int = 3):
vectors = get_vectors_from_collection(collection)
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
query_projections = project_embeddings(collection._embedding_function([query]), umap_transform)
hypothetical_documents = generate_hypothetical_document(query, num_hypo_documents)
query_variations_projections = project_embeddings(collection._embedding_function(hypothetical_documents), umap_transform)
original_relevant_docs = collection.query(
query_texts=[query],
n_results=num_results,
)
original_relevant_docs_ids = [item for sublist in original_relevant_docs["ids"] for item in sublist] # flatten
original_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=original_relevant_docs_ids)["embeddings"]
original_relevant_docs_projections = project_embeddings(original_relevant_docs_embeddings, umap_transform)
additional_relevant_docs = collection.query(
query_texts=hypothetical_documents,
n_results=num_results,
)
additional_relevant_docs_ids = [item for sublist in additional_relevant_docs["ids"] for item in sublist] # flatten
# remove duplicates
additional_relevant_docs_ids = list(set(additional_relevant_docs_ids))
# remove the original relevant docs from the additional relevant docs
additional_relevant_docs_ids = [doc_id for doc_id in additional_relevant_docs_ids if doc_id not in original_relevant_docs_ids]
additional_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=additional_relevant_docs_ids)["embeddings"]
additional_relevant_docs_projections = project_embeddings(additional_relevant_docs_embeddings, umap_transform)
fig = go.Figure()
fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
fig.add_trace(go.Scatter(x=query_projections[:, 0], y=query_projections[:, 1], mode='markers', marker=dict(size=7, color='black', symbol='x'), name="original query"))
fig.add_trace(go.Scatter(x=query_variations_projections[:, 0], y=query_variations_projections[:, 1], mode='markers', marker=dict(size=7, color='red', symbol='x'), name="hypothetical documents"))
fig.add_trace(go.Scatter(x=original_relevant_docs_projections[:, 0], y=original_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='orange'), name="original relevant docs"))
fig.add_trace(go.Scatter(x=additional_relevant_docs_projections[:, 0], y=additional_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='green'), name="additional relevant docs"))
fig.show()
plot_hyde_retrieval_results("Climate Change", selected_collection)
class HyDERetriever(BaseRetriever):
store: VectorStore
num_hypo_documents: int = 2
num_results: int = 3
def _get_hypothetical_documents(self, query: str) -> List[str]:
return generate_hypothetical_document(query, self.num_hypo_documents)
def _get_relevant_documents(
self, original_query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
hypothetical_documents = self._get_hypothetical_documents(original_query)
hypothetical_documents.append(original_query)
retriever = store_to_retriever(self.store, k=self.num_results)
relevant_docs = []
for query in hypothetical_documents:
results = retriever.invoke(query, run_manager=run_manager)
# remove duplicates
for res in results:
if res not in relevant_docs:
relevant_docs.append(res)
return relevant_docs
hyde_retriever = HyDERetriever(store=best_store, k=3)
hyde_chain = create_qa_chain(hyde_retriever)
hyde_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(page_content='Blue River, Vida, Phoenix, and Talentβwere lost to the so-called Labor Day Fires in 2020. And in 2021, the Lytton Creek Fire wiped out the village of Lytton, British Columbia, destroying hundreds of homes. All told, between 2017 and 2021, nearly 120,000 fires burned across western North America, burning nearly 39 million acres of land and claiming more than 60,000 structures. The impacts of wildfires reach well beyond the people, communities, and ecosystems that are directly affected by flames. Wildfires have consequences for public health, water supplies, and economies long after a fire is extinguished. Mounting research is showing exposure to the fine particulate matter in wildfire smoke is responsible for thousands of indirect deaths, increases to the risk of pre-term birth among pregnant women, and even an increase in the risk of COVID-19 illness and death. Surprisingly, some of the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In', metadata={'domain': 'cleantechnica', 'id': 5660, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In addition to their impact on air quality, wildfires can disrupt processes that maintain access to drinking water, including by reducing the ability of soil to absorb water when it rains and sending additional sediment into drinking water systems. The science is clear that climate change is increasing whatβ s known as the β vapor pressure deficit, β or VPD, across western North America. When VPD is high, the atmosphere can pull more water out of plants, which dries them out and makes them more likely to burn. VPD also is also a good metric for drought, including the long 21st century drought the region has been experiencing. But climate isnβ t the only factor behind the westβ s worsening wildfires. More than a century of aggressive fire suppression and an even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too', metadata={'domain': 'cleantechnica', 'id': 5661, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='Letβ s dive into western wildfires by the numbers. As spring turns to summer and the days warm up, the Northern Hemisphere enters the period known as Danger Season, when wildfires, heat waves, and hurricanes, all amplified by climate change, begin to ramp up. In the western United States, the start of Danger Season is marked by the shift from the wintertime wet season to the summertime dry season. While wildfires can and do occur all year round, this shift from cool and wet to warm and dry marks the start of wildfire season in the region. According to the latest seasonal outlook from the National Interagency Fire Center, the exceptionally rainy and snowy conditions the west experienced during the winter of 2022-2023 are translating to below-average to normal levels of wildfire risk across most western states at least through August. That said, above-normal activity is expected for parts of Washington, Oregon, Idaho, and Nevada. Some scientists are also raising concerns that all the young grasses and other', metadata={'domain': 'cleantechnica', 'id': 5654, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too uniform in their species, and without the resistance to fire they once had. Moreover, a lack of affordable housing across the region and the desire for proximity to beautiful, natural places has led to large increases in the number of people living in wildfire-prone areas. Recent research has found that human activities were responsible for starting more than 80% of all wildfires in the United States, while also increasing the length of the fire season. This yearβ s wildfire season may offer the western US a chance to catch its breath after several years of record-breaking fires. But with climate change expected to deepen the hot, dry conditions that enable such record-breaking fires, we must be preparing for a future with even more fire. Carly Phillips contributed to this post. We publish a number of guest posts from experts in a large variety of fields. This is our contributor', metadata={'domain': 'cleantechnica', 'id': 5662, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'})],
'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
'answer': 'The biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas.'}
datasets["hyde"] = get_or_create_eval_dataset("hyde", eval_df, hyde_chain)
Loaded hyde dataset from data\gold\datasets\hyde_dataset.json
llm_results["hyde"] = get_or_run_llm_eval("hyde", datasets["hyde"], llm)
Loaded hyde evaluation results from data\gold\results\hyde_llm_eval_results.csv
strategy_results["hyde"] = llm_results["hyde"]
plot_multiple_evals(strategy_results)
Just like with multi-querying we can see that the answer correctness increases when using the HyDE method.
There are many other methods that can be used to improve the RAG pipeline. Some of these include:
os.system("jupyter nbconvert --to html --template pj cleantech_rag.ipynb")
0